An In-Depth Analysis of r/UMD

image.png

By Matt Graber, Tim Henderson, Matt Vorsteg, and Jordan Woo

r/UMD is the official subreddit (sub-community of the popular social media news aggregation website Reddit) for the University of Maryland, College Park. Simply by looking at the front page of r/UMD, we can see that the community was first created on April 15, 2010, and there are 20,789 Reddit users who have joined it. For this data analysis project, we'll be digging deeper, analyzing the posts, comments, and the users of r/UMD themselves to find meaningful insights about the subreddit.

Note: throughout this Jupyter Notebook, all of our plots will be created with the Plotly Python Open Source Graphing Library. We have chosen to use this library to create our plots because it allows each plot to be interactive. Simply hover your cursor over any portion of the graphic to view the data at that point, and click and drag within the plot to zoom in on specific portions. To zoom out, double-click within the plot.

In [6]:
# Here are are the installations/imports that we will be using throughout this project.
# Their uses will be made apparent as we utilize them.
!pip install nltk
!pip install praw
!pip install psaw
!pip install plotly
!pip install vaderSentiment
import datetime as dt
from datetime import timedelta, datetime
import time
import praw
import sqlite3
from sqlite3 import Error
import pandas as pd
from psaw import PushshiftAPI
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import statsmodels.stats.proportion as smp
import statsmodels.formula.api as smf
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from random import randint
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from math import log
Collecting nltk
  Downloading https://files.pythonhosted.org/packages/f6/1d/d925cfb4f324ede997f6d47bea4d9babba51b49e87a767c170b77005889d/nltk-3.4.5.zip (1.5MB)
     |████████████████████████████████| 1.5MB 3.4MB/s eta 0:00:01
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from nltk) (1.13.0)
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... done
  Created wheel for nltk: filename=nltk-3.4.5-cp37-none-any.whl size=1449906 sha256=941b5cd35142bee211ff462748e67e509d3a455e1c5b0b260e668e84f963417a
  Stored in directory: /home/jovyan/.cache/pip/wheels/96/86/f6/68ab24c23f207c0077381a5e3904b2815136b879538a24b483
Successfully built nltk
Installing collected packages: nltk
Successfully installed nltk-3.4.5
Collecting praw
  Downloading https://files.pythonhosted.org/packages/f6/df/b42c0a3b86a43a62a46e5b2f07930230ac7719624800a2052218993fb767/praw-6.4.0-py2.py3-none-any.whl (126kB)
     |████████████████████████████████| 133kB 2.6MB/s eta 0:00:01
Collecting websocket-client>=0.54.0
  Downloading https://files.pythonhosted.org/packages/29/19/44753eab1fdb50770ac69605527e8859468f3c0fd7dc5a76dd9c4dbd7906/websocket_client-0.56.0-py2.py3-none-any.whl (200kB)
     |████████████████████████████████| 204kB 4.0MB/s eta 0:00:01
Collecting prawcore<2.0,>=1.0.1
  Downloading https://files.pythonhosted.org/packages/76/b5/ce6282dea45cba6f08a30e25d18e0f3d33277e2c9fcbda75644b8dc0089b/prawcore-1.0.1-py2.py3-none-any.whl
Collecting update-checker>=0.16
  Downloading https://files.pythonhosted.org/packages/17/c9/ab11855af164d03be0ff4fddd4c46a5bd44799a9ecc1770e01a669c21168/update_checker-0.16-py2.py3-none-any.whl
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from websocket-client>=0.54.0->praw) (1.13.0)
Requirement already satisfied: requests<3.0,>=2.6.0 in /opt/conda/lib/python3.7/site-packages (from prawcore<2.0,>=1.0.1->praw) (2.22.0)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests<3.0,>=2.6.0->prawcore<2.0,>=1.0.1->praw) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests<3.0,>=2.6.0->prawcore<2.0,>=1.0.1->praw) (1.25.6)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests<3.0,>=2.6.0->prawcore<2.0,>=1.0.1->praw) (2019.9.11)
Requirement already satisfied: idna<2.9,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests<3.0,>=2.6.0->prawcore<2.0,>=1.0.1->praw) (2.8)
Installing collected packages: websocket-client, prawcore, update-checker, praw
Successfully installed praw-6.4.0 prawcore-1.0.1 update-checker-0.16 websocket-client-0.56.0
Collecting psaw
  Downloading https://files.pythonhosted.org/packages/60/b7/6724defc12bdcc45470e2b1fc1b978367f3d183ec6c6baa2770a0b083fc7/psaw-0.0.7-py3-none-any.whl
Requirement already satisfied: requests in /opt/conda/lib/python3.7/site-packages (from psaw) (2.22.0)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests->psaw) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests->psaw) (2019.9.11)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests->psaw) (1.25.6)
Requirement already satisfied: idna<2.9,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests->psaw) (2.8)
Installing collected packages: psaw
Successfully installed psaw-0.0.7
Collecting plotly
  Downloading https://files.pythonhosted.org/packages/8e/ce/6ea5683c47b682bffad39ad41d10913141b560b1b875a90dbc6abe3f4fa9/plotly-4.4.1-py2.py3-none-any.whl (7.3MB)
     |████████████████████████████████| 7.3MB 4.1MB/s eta 0:00:01     |███████████▏                    | 2.5MB 2.7MB/s eta 0:00:02
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from plotly) (1.13.0)
Collecting retrying>=1.3.3
  Downloading https://files.pythonhosted.org/packages/44/ef/beae4b4ef80902f22e3af073397f079c96969c69b2c7d52a57ea9ae61c9d/retrying-1.3.3.tar.gz
Building wheels for collected packages: retrying
  Building wheel for retrying (setup.py) ... done
  Created wheel for retrying: filename=retrying-1.3.3-cp37-none-any.whl size=11429 sha256=883d3189d63f1a66ef03cf6b7eb2b185d4c3257a990962f5e7e4d9867db321e1
  Stored in directory: /home/jovyan/.cache/pip/wheels/d7/a9/33/acc7b709e2a35caa7d4cae442f6fe6fbf2c43f80823d46460c
Successfully built retrying
Installing collected packages: retrying, plotly
Successfully installed plotly-4.4.1 retrying-1.3.3
Collecting vaderSentiment
  Downloading https://files.pythonhosted.org/packages/86/9e/c53e1fc61aac5ee490a6ac5e21b1ac04e55a7c2aba647bb8411c9aadf24e/vaderSentiment-3.2.1-py2.py3-none-any.whl (125kB)
     |████████████████████████████████| 133kB 3.2MB/s eta 0:00:01
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.2.1
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

Let's Get the Data!

Below we will outline our process for retrieving the necessary information from Reddit. We begin by connecting to a Python wrapper for the Pushshift API that allows us to access Reddit data. We define a couple of functions to help us simplify using sqlite3 later on, and then we outline our SQL statements for creating our various tables and execute those statements.

In [ ]:
# Connecting to the API
r = praw.Reddit(client_id=*******,
                client_secret=*******,
               user_agent=*******)
api = PushshiftAPI(r)
In [7]:
# create a database connection to the SQLite database specified by db_file
def create_connection(db_file):
    conn = None
    try:
        conn = sqlite3.connect(db_file)
        return conn
    except Error as e:
        print(e)
 
    return conn

# create a table from the create_table_sql statement
def create_table(conn, create_table_sql):
    try:
        c = conn.cursor()
        c.execute(create_table_sql)
    except Error as e:
        print(e)
In [ ]:
# variables for commands for creating SQL tables

sql_create_user_table = """ CREATE TABLE IF NOT EXISTS User (
                                    name text PRIMARY KEY,
                                    flair text,
                                    created_utc float NOT NULL
                                ); """

sql_create_user_subreddits_table = """ CREATE TABLE IF NOT EXISTS UserSubreddits (
                                    name text,
                                    subreddit text,
                                    FOREIGN KEY (name) references User (name)
                                ); """

sql_create_post_table = """CREATE TABLE IF NOT EXISTS Post (
                                id text PRIMARY KEY,
                                name text NOT NULL,
                                url text,
                                title text,
                                selftext text,
                                score integer NOT NULL,
                                created_utc float NOT NULL,
                                permalink text,
                                link_flair_text text,
                                FOREIGN KEY (name) REFERENCES User (name)
                            );"""

sql_create_comment_table = """CREATE TABLE IF NOT EXISTS Comment (
                                id text PRIMARY KEY,
                                name text NOT NULL,
                                body text,
                                score integer NOT NULL,
                                parent_id text NOT NULL,
                                link_id text NOT NULL,
                                created_utc float NOT NULL,
                                FOREIGN KEY (name) REFERENCES User (name),
                                FOREIGN KEY (parent_id) REFERENCES Comment (id),
                                FOREIGN KEY (link_id) REFERENCES Post (id)
                            );"""

# create a database connection
conn = create_connection("./R_UMD.db")

# create tables
if conn is not None:
    create_table(conn, sql_create_user_table)
    create_table(conn, sql_create_user_subreddits_table)
    create_table(conn, sql_create_post_table)
    create_table(conn, sql_create_comment_table)
    conn.close()
else:
    print("Error! cannot create the database connection.")  

Scraping Begins

Our scraping will take place in 3 phases:

  1. Getting all submissions/posts
  2. Getting all comments and user flairs from those comments
  3. Getting the list of subreddits that all users of r/UMD also participate in

For (1), we will make a request for all submissions on r/UMD after Jan 1, 2010 (before r/UMD existed) and store them in the database. We will also keep track of the users and their information along the way.

In [ ]:
# create tables
if conn is not None:
    # r/UMD was born April of 2010, get all posts from then on
    start_epoch = int(dt.datetime(2010, 1, 1).timestamp())
    
    # actual request to API
    # we are first looking for 'submissions', i.e. posts/top-level comments
    results = list(api.search_submissions(after=start_epoch,
                                          subreddit='UMD',
                                          filter=['url','author', 'title', 'subreddit'],
                                          limit=None))

    # for each result put appropriate information in the appropriate table
    for res in results :
        # we will first add the user if they aren't already in there
        # we will deal will the user flairs later
        user_task = (str(res.author), res.created_utc) 
        user_sql = ''' INSERT or IGNORE INTO User(name,created_utc)
              VALUES(?,?) '''
        
        # we will then deal with adding the information from the post to the post table
        post_task = (res.id, str(res.author), res.url, res.title, str(res.selftext), res.score,
                     res.created_utc, str(res.permalink), str(res.link_flair_text))
        post_sql = ''' INSERT or IGNORE INTO Post(id,name,url,title,selftext,score,created_utc,permalink,link_flair_text)
              VALUES(?,?,?,?,?,?,?,?,?) '''
        
        # try executing SQL statements above
        cur = conn.cursor()
        try :
            cur.execute(user_sql, user_task)
            cur.execute(post_sql, post_task)
        except :
            cur.close()
            conn.close()
            
        # commit additions to the DB
        conn.commit()
     
    # close connection for now
    conn.close()
else:
    print("Error! cannot create the database connection.")

Phase 2

Next, we need to get a table of all comments from r/UMD. To do that, we first read all of our submissions into a Pandas DataFrame to make accesses quicker, and then get the comments for each of the submissions we just scraped. This one takes a couple of hours, so go check out r/UMD and read some for yourself!

In [ ]:
# create a database connection and make a dataframe so we can access the submissions quicker
conn = create_connection("./R_UMD.db")
df = pd.read_sql("SELECT * FROM Post", conn)

# create tables
if conn is not None :
    # we need to loop through all of the submissions that we just collected in order to get their comments
    for i, row in df.iterrows() :
        # actual call to the API, we can get the submission as an object and read its subseequent comments as a list
        sub = r.submission(id=row['id'])
        comment_list = sub.comments.list()
        # we will add EVERY comment into the database, along with user information (new users) and flair data if available
        for comment in comment_list :
            # add the comment to the comment tables
            comment_task = (str(comment.id), str(comment.author), comment.body, comment.score,
                            comment.parent_id, comment.link_id, comment.created_utc) 
            comment_sql = ''' INSERT or IGNORE INTO Comment(id,name,body,score,parent_id,link_id,created_utc)
                 VALUES(?,?,?,?,?,?,?) '''

            # we will need to add the user if they are not already in the user table
            user_task = (str(comment.author), comment.author.created_utc)
            user_sql = ''' INSERT or IGNORE INTO User(name,created_utc)
                  VALUES(?,?) '''

            # if we can get a flair from a user's comment, we will update the user table to have the flair for that user
            flair_task = (str(comment.author_flair_text), str(comment.author))
            flair_sql = ''' UPDATE User SET flair=(?) WHERE name=(?)'''

            # try executing SQL statements above
            cur = conn.cursor()
            try :
                ur.execute(comment_sql, comment_task)
                cur.execute(user_sql, user_task)
                cur.execute(flair_sql, flair_task)
            except Exception as e:
                print(e)

            # commit additions to the DB
            conn.commit()
            
    # close connection for now
    conn.close()
else:
    print("Error! cannot create the database connection.")

Phase 3

The last step in the scraping process is to grab a list of all the subreddits that any user on r/UMD have ever commented in. We want to analyze later on what other subreddits that r/UMD users are interested in. We will do this again by reading our current SQL user table into a Pandas DataFrame and then grabbing the subreddit from each comment for a user, for all users.

In [ ]:
# create a database connection and make a dataframe for quicker operations, again
conn = create_connection("./R_UMD.db")
df = pd.read_sql("SELECT * FROM User", conn)

# create tables
if conn is not None :
    # we need to get each user and look at their individual subreddit history
    for i, row in df.iterrows() :
        # actual call to the API, get a 'User' object so we can see all their comments
        red = r.redditor(row['name'])
            # we will loop through all of the user's comments and add the subreddit they commented in to the database
            for x in red.comments.new(limit=None) :
                # we will simply add the username and subreddit to the table
                subreddit_task = (row['name'], str(x.subreddit)) 
                subreddit_sql = ''' INSERT or IGNORE INTO UserSubreddits(name,subreddit)
                      VALUES(?,?) '''

                # try executing SQL statement above
                cur = conn.cursor()
                try :
                    cur.execute(subreddit_sql, subreddit_task)
                except :
                    cur.close()
                    conn.close()

        conn.commit()

    conn.close()
else:
    print("Error! cannot create the database connection.")

Preparing the Data for Analysis

Our data is pretty tidy after scraping from Reddit (we will do some cleaning of text later on), so now we will read it one last time into four separate Pandas DataFrames. We also want to get rid of any bots that may have posted in r/UMD. To do this we will simply remove any data associated with users whose usernames end in "bot" (most honest bots end with this string).

In [8]:
# create the connection
conn = create_connection("./R_UMD.db")

# make dataframes from each table in the SQLite database
df_user = pd.read_sql("SELECT * FROM User", conn)
df_user_sub = pd.read_sql("SELECT * FROM UserSubreddits", conn)
df_post = pd.read_sql("SELECT * FROM Post", conn)
df_comment = pd.read_sql("SELECT * FROM Comment", conn)

# close connection -- no longer needed
conn.close()

# try and find all bots (usernames end with 'bot')
df_bot = df_user_sub[df_user_sub['name'].str.endswith("bot")]
gb = df_bot.groupby('name')    
gb = [gb.get_group(x) for x in gb.groups]

# remove any row from df_user_sub if it's from a bot
for name in gb :
    df_user_sub = df_user_sub[df_user_sub.name != str(name['name'].reset_index(drop=True)[0])]
    df_user = df_user[df_user.name != str(name['name'].reset_index(drop=True)[0])]
    df_post = df_post[df_post.name != str(name['name'].reset_index(drop=True)[0])]
    df_comment = df_comment[df_comment.name != str(name['name'].reset_index(drop=True)[0])]

Are users exclusive to r/UMD?

Reddit has become an increasingly popular form of spreading news around campus or promoting a club or event. Because of this, we wanted to investigate how many users are active Redditors and which ones have accounts just to post on r/UMD. As it turns out, about 50% (~9,000) of the users who have ever made a post on r/UMD post on r/UMD infrequently (<25% of their posts). On the other hand, about 25% (~4,000) of r/UMD users post exclusively on r/UMD (95% of their posts or higher).

In [4]:
# group the dataframe by user
gb = df_user_sub.groupby('name')    
gb = [gb.get_group(x) for x in gb.groups]

l = list()
names = list()

# for each users group of submissions, find out what percentage of them are in r/UMD
for name in gb :
    try :
        l.append([str(name['name'].reset_index(drop=True)[0]),
                  (name[name.subreddit == 'UMD']['subreddit'].value_counts()/name['subreddit'].count())[0]])
    except :
        # User never posted in r/UMD, only commented
        pass

# plot the result as a histogram
fig = px.histogram(x=[row[1]*100 for row in l], nbins=5, title="Distribution of Users' Percentage of Posts on r/UMD",
                   labels=dict(x="Percentage of Posts in r/UMD"))
fig.update_xaxes(range=[0, 100])
fig.update_layout(yaxis_title="Number of Users")
fig.show()

So where are they posting if not on r/UMD?

It is clear that despite a large portion of users only posting on r/UMD, there is still a large percentage of users that are active in other subreddits. Below are the top 30 alternative subreddits that r/UMD users post in. They are split up into two groups:

  1. Default subreddits
  2. Other subreddits

Default subreddits are subreddits that a user is automatically subscribed to when they first make an account on Reddit. r/UMD users have started to branch away from these subreddits as 20 of the top 30 subreddits they post in are non-default.

In [5]:
# dataframe that has posts not in r/UMD
df_non_umd = df_user_sub[df_user_sub.subreddit != 'UMD']
top_subs = (df_non_umd['subreddit'].value_counts()/df_non_umd['subreddit'].count())[:30]*100
subs = top_subs.index.tolist()
vals = top_subs.tolist()

# new figure for graph
fig = go.Figure()

# list of defaults subreddits that users are subscribed to.
defaults = ['AskReddit','funny','pics','todayilearned','gaming','videos','IAmA','worldnews','news','aww','gifs','movies',
'mildlyinteresting','Showerthoughts','Music','science','explainlikeimfive','LifeProTips','personalfinance']

c_def = 0
c_oth = 0

for i, sub in enumerate(subs) :
    # add a new bar to the graph
    fig.add_trace(go.Bar(
        x=[sub],
        y=[vals[i]],
        name='Default Subreddits' if sub in defaults else 'Other',
        marker_color='lightsalmon' if sub in defaults else 'blue',
        showlegend=True if c_def == 0 and sub in defaults or c_oth == 0 and sub not in defaults else False,
        legendgroup='lightsalmon' if sub in defaults else 'blue'
    ))
    c_def += 1 if sub in defaults else 0
    c_oth += 1 if sub not in defaults else 0

# plot the graph
fig.update_layout(yaxis_title="Percentage of All Users' Posts", xaxis_title="Subreddit", title="Most Popular Alternative Subreddits")
fig.show()

Do the users match the demographic of the University?

According to The Office of Institutional Research Planning and Assessment, the top 5 undergraduate degrees are all in 'STEM' majors. We wanted to see if this popularity trend persisted in the r/UMD user base. About 10% of the users on r/UMD have 'flairs', i.e. banners next to their name that typically describe what major they are. See this for more detail. Using the users who have flairs as a representative sample, we can make a prediction on what majors the rest of the r/UMD users are.

In [6]:
# make a new dataframe that only has 
df_user_train = df_user.replace(to_replace='None', value=np.nan).replace(to_replace='', value=np.nan).dropna()
df_user_train['flair_clean'] = "unknown"

# function to determine if a user is STEM or NON-STEM
# returns the type of major as a string, 'unknown' if cannot be determined
def flair_clean(flair) :
    flair = ''.join(i for i in flair if not i.isdigit())
    flair = flair.lower()
    
    # list of prefixes/infixes that denote a STEM major
    stem = {"cs","computer science","comp sci","cmsc","kruskal","cmns","compsci","info sci","ischool","infosci", \
            "bchm","bio","chem","compe","ce","computer","compeng","comp","ee","aero","enae","enme","mech","meng", \
           "math","markov","phys","it support","phnb","aosc","gis","bsci","info","chbe","fire","inst","ae", \
           "network","premed","fpe","stem","ensp","enst","astro"," is", "civ","comsci","ents","mse", "stack", \
           "eng","stat","amsc","numer","web","matsci","cmps","psci","cbmg","cpe","astr","me ","mate","enfp", \
           "anatomy","bis","soft"}
    # list of prefixes/infixes that denote a NON-STEM major
    non_stem = {"econ","comm","journal","gvpt","government","policy","gov","criminal","ccjs","crim","bmgt", \
               "manage","market","business","kinesi","knes","psyc","ecology","sociology","socy","design","anthro", \
               "film","english","engl","arch","larc","philosophy","arhu","women","arec","anth","creative", "history", \
               "ansc","hist","plcy","amer","account","jour","geog","supply","art","geol","theatre","scm","agnr", \
               "music","social","lang","hort","public","ling","elem","arabic","hcim","nfsc","jap","fmsc","mph", \
               "ath","jd","fin","russian","germ","fam","agro","enology"}
    for major in stem :
        if (major in flair) :
            return "STEM"
    for major in non_stem :
        if (major in flair) :
            return "NON STEM"
    return "unknown"


for i, row in df_user_train.iterrows() :
    new = flair_clean(str(row['flair']))
    df_user_train.at[i, 'flair_clean'] = new

After cleaning the user flair data and determining the percentage of STEM and NON STEM majors in the sample, we can then find the confidence interval for a single proportion. Looking at the sample, we are 95% confident that the true proportion of STEM majors on r/UMD is between 77.8% and 81.93%. Finding complete data for the University is hard, although the IRPA report indicates that the true percentage of STEM majors for the university may lie closer to 50% - 60%. We also must be wary of our results, as our sample was not truly random due to some data missing at random.

In [7]:
# get the number of STEM and NON STEM majors
num_stem = len(df_user_train[df_user_train.flair_clean == "STEM"])
num_non_stem = len(df_user_train[df_user_train.flair_clean == "NON STEM"])

# calculate the 95% confidence interval
lower, upper = smp.proportion_confint(num_stem, num_stem + num_non_stem, alpha=0.05, method='normal')
error = (upper - lower)
avg = (upper + lower) / 2

# labels for the plot
labels = ['STEM Majors','Margin of Error','NON-STEM Majors']
values = [lower, error, 1-upper]

# plot the pie graph
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(title_text='Estimated Percentage of STEM Majors and NON-STEM Majors (95% Confidence)')
fig.show()

Time-series Regression Analysis

Let's start by plotting histograms of post activity and comment activity over time.

In [8]:
fig = px.histogram(
    x=[datetime.utcfromtimestamp(s) for s in df_post["created_utc"]],
    title='Posts over Time'
)
fig.show()

fig2 = px.histogram(
    x=[datetime.utcfromtimestamp(s) for s in df_comment["created_utc"]],
    title='Comments over Time'
)
fig2.show()

Pretty cool! You can definitely see the general upward trend of activity, as well as seasonal spikes. Hovering over the months, you can see that the dips are typically June, July, August, as well as January and February. This makes sense, as these are breaks in which students are not engaged with UMD on a day to day basis.

So what if we want to put numbers to these trends? For example, how many fewer posts per month are there during breaks, and approximately how many more comments per month are there per year? In order to do this, we should fit a regression, predicting activity (measured by posts/comments per month) based on year and season.

The first thing we need to do is to restructure the data so that we have the posts/comments per month available.

In [9]:
# Don't mess up the rest of the code
df_commentActivity = df_comment.copy()
df_postActivity = df_post.copy()

# Currently, we only have timestamps. Create two new columns by converting those timestamps into dates,
# and retrieving the relevant information from those dates.
df_commentActivity["month"] = [datetime.utcfromtimestamp(s).strftime("%b") for s in df_comment["created_utc"]]
df_commentActivity["year"] = [datetime.utcfromtimestamp(s).year for s in df_comment["created_utc"]]
df_postActivity["month"] = [datetime.utcfromtimestamp(s).strftime("%b") for s in df_post["created_utc"]]
df_postActivity["year"] = [datetime.utcfromtimestamp(s).year for s in df_post["created_utc"]]

# By grouping by month and year, we can get a count for every month (e.g. Nov 2016, Dec 2016, Jan 2017, etc.)
df_commentActivity = \
    df_commentActivity.groupby(["year", "month"], as_index=False).size().reset_index().rename(columns={0: "count"})
df_postActivity = \
    df_postActivity.groupby(["year", "month"], as_index=False).size().reset_index().rename(columns={0: "count"})

# One might think that a good categorization of the months would be by season - Winter, Spring, etc.
# However, looking at the seasonal trends on the histogram, there doesn't seem to be a big distinction between
# fall and spring semesters, nor winter and summer breaks. Thus, we can split Break/Semester instead of season.
seasonLookup = {
    "Jan": "Break",
    "Feb": "Semester",
    "Mar": "Semester",
    "Apr": "Semester",
    "May": "Semester",
    "Jun": "Break",
    "Jul": "Break",
    "Aug": "Break",
    "Sep": "Semester",
    "Oct": "Semester",
    "Nov": "Semester",
    "Dec": "Break"
}

# Create two new columns based on our lookup.
df_commentActivity["season"] = [seasonLookup[m] for m in df_commentActivity["month"]]
df_postActivity["season"] = [seasonLookup[m] for m in df_postActivity["month"]]

Now that the data is structured, we can try to fit a regression. Regression is the use of one or more x variables to predict a y variable, by using a line of best fit. A gentle introduction to regression can be found here using only one x predictor.

Initially, we have one response (count), and two predictors (year, and season). Let's try to fit it now.

In [10]:
modelComment = smf.ols(formula='count ~ year+season', data=df_commentActivity).fit()
# Regression fit
print(modelComment.summary())

# Graphs to check assumptions of regression
px.violin(x=modelComment.fittedvalues, y=modelComment.resid, title='Residuals vs. Fitted').show()
px.histogram(x=modelComment.resid, title='Distribution of Residuals').show()
px.scatter(x=[i for i in range(len(modelComment.resid))], y=modelComment.resid, title='Residuals vs. Index')
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  count   R-squared:                       0.742
Model:                            OLS   Adj. R-squared:                  0.737
Method:                 Least Squares   F-statistic:                     159.4
Date:                Mon, 16 Dec 2019   Prob (F-statistic):           2.32e-33
Time:                        04:09:22   Log-Likelihood:                -939.82
No. Observations:                 114   AIC:                             1886.
Df Residuals:                     111   BIC:                             1894.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept          -1.123e+06    6.4e+04    -17.551      0.000   -1.25e+06   -9.96e+05
season[T.Semester]   404.1425    177.209      2.281      0.024      52.992     755.294
year                 558.1162     31.751     17.578      0.000     495.200     621.032
==============================================================================
Omnibus:                        4.089   Durbin-Watson:                   0.702
Prob(Omnibus):                  0.129   Jarque-Bera (JB):                3.965
Skew:                           0.455   Prob(JB):                        0.138
Kurtosis:                       2.928   Cond. No.                     1.47e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.47e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

The five assumptions

Looking at just the p-values, it looks good. All of them are significantly below the 5% threshold. However, regression only works under five assumptions:

  1. Linearity
  2. Independence
  3. Normality
  4. Homoscedasticity
  5. Non-collinearity

The violin graph is used to assess linearity, and homoscedasticity. For linearity, we are checking that each violin is centered around 0, and for homoscedasticity, we are checking that each violin is "roughly" the same size. Both of these clearly are violated due to many of the violins not being centered at 0, and wild variation in violin lengths.

The histogram is used to assess normality, and we are checking that the histogram follows a bell-curve. There is a right skew, and the curve is a little flat to be considered normal. Thus, normality is also violated.

The last scatter plots index vs. residual, used to check independence, and we are checking if there is a pattern with regards to the obesrvations. There is a clear pattern: a downward trend until about the 80th observation.

Statsmodels has also given us a warning that multicollinearity may be an issue.

Oh no! We have scored a 0/5 in meeting the assumptions. In order for our model to give us usable numbers, we must attempt to meet all of these assumptions.

The following steps will identify each of the assumptions, explain in simple terms why it is neccessary to uphold that assumption before performing analysis, and suggest a correction to meet this assumption for the given data set of r/UMD data. Further reading on other corrections for other data sets can be found here, provided by Duke University.

Assumption 1: Linearity

The first assumption we should tackle is linearity. Linearity means that there is roughly a linear relationship between the predictors and the response. Because it was violated in our initial model, it seems that we may have a non-linear relationship between one of our predictors and count. This could make sense, it seems that as the years increase, the activity increases at a non-linear rate. We will try introducing a year2 variable to our dataset to try and capture this non-linear relationship.

In [11]:
# The square function
square = lambda x: x**2

# add the year squared
modelComment = smf.ols(formula='count ~ year+square(year)+season', data=df_commentActivity).fit()
print(modelComment.summary())

px.violin(x=modelComment.fittedvalues, y=modelComment.resid, title='Residuals vs. Fitted').show()
px.histogram(x=modelComment.resid, title='Distribution of Residuals').show()
px.scatter(x=[i for i in range(len(modelComment.resid))], y=modelComment.resid, title='Residuals vs. Index')
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  count   R-squared:                       0.821
Model:                            OLS   Adj. R-squared:                  0.816
Method:                 Least Squares   F-statistic:                     168.2
Date:                Mon, 16 Dec 2019   Prob (F-statistic):           6.10e-41
Time:                        04:09:25   Log-Likelihood:                -918.93
No. Observations:                 114   AIC:                             1846.
Df Residuals:                     110   BIC:                             1857.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept           3.007e+08   4.32e+07      6.954      0.000    2.15e+08    3.86e+08
season[T.Semester]   426.4218    148.232      2.877      0.005     132.661     720.182
year                -2.99e+05   4.29e+04     -6.966      0.000   -3.84e+05   -2.14e+05
square(year)          74.3580     10.654      6.979      0.000      53.245      95.471
==============================================================================
Omnibus:                       13.137   Durbin-Watson:                   1.021
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               20.542
Skew:                           0.532   Prob(JB):                     3.46e-05
Kurtosis:                       4.786   Cond. No.                     2.40e+12
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.4e+12. This might indicate that there are
strong multicollinearity or other numerical problems.

That seemed to really help! The violins are more centered around 0, the histogram looks more like a bell curve, and we've eliminated the trend in the independence graph. However, the violins on the right are much bigger than the violins on the left—we will attempt to tackle this next.

Assumption 2: Homoscedasticity

Homoscedasticity is a fancy word meaning "equal spread." When we check for this assumption, we are making sure that the variance in the residuals is roughly the same in all places. The residuals are errors: if the errors get increasingly larger as the predicted values get larger, then our model will have trouble accurately predicting for large y-values. Again, this is due to the non-linear trend in the data, as we start with comments in the tens and hundreds and grow to thousands. Using a log transformation on the count variable will lessen the impact of how quickly the subreddit grew, and hopefully ensure the violins all are the same size.

In [12]:
square = lambda x: x**2

# transform count into log(count)
modelComment = smf.ols(formula='np.log(count) ~ year+square(year)+season', data=df_commentActivity).fit()
print(modelComment.summary())

px.violin(x=modelComment.fittedvalues, y=modelComment.resid, title='Residuals vs. Fitted').show()
px.histogram(x=modelComment.resid, title='Distribution of Residuals').show()
px.scatter(x=[i for i in range(len(modelComment.resid))], y=modelComment.resid, title='Residuals vs. Index')
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(count)   R-squared:                       0.790
Model:                            OLS   Adj. R-squared:                  0.784
Method:                 Least Squares   F-statistic:                     138.0
Date:                Mon, 16 Dec 2019   Prob (F-statistic):           3.85e-37
Time:                        04:09:27   Log-Likelihood:                -92.059
No. Observations:                 114   AIC:                             192.1
Df Residuals:                     110   BIC:                             203.1
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept          -1.145e+05   3.06e+04     -3.741      0.000   -1.75e+05   -5.38e+04
season[T.Semester]     0.3643      0.105      3.472      0.001       0.156       0.572
year                 113.3200     30.387      3.729      0.000      53.099     173.541
square(year)          -0.0280      0.008     -3.717      0.000      -0.043      -0.013
==============================================================================
Omnibus:                       59.243   Durbin-Watson:                   1.122
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              371.777
Skew:                          -1.582   Prob(JB):                     1.86e-81
Kurtosis:                      11.262   Cond. No.                     2.40e+12
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.4e+12. This might indicate that there are
strong multicollinearity or other numerical problems.

The transformation again did wonders for all three graphs: the violins (bar one) are now roughly the same size, our histogram (bar the skew) is normal, and the independence graph is looking better as well. In all three cases, there is one culprit: outliers.

Assumption 3: Normality

The assumption of normality checks that the residuals (not the data!) follow a normal distribution. If the residuals are non-normal, such as right now, there will be a skew when estimating p-values. Our curve looks pretty decent, except for a few outliers giving it a right skew. In order to fix this, we will investigate the outliers and possibly remove them.

In [13]:
square = lambda x: x**2

# previous model
modelComment = smf.ols(formula='np.log(count) ~ year+square(year)+season', data=df_commentActivity).fit()

# test for outliers
test = modelComment.outlier_test()
outliers = [i for i,t in enumerate(test["bonf(p)"]) if t < 0.5]
# investigate outliers
print([df_commentActivity.iloc[i] for i in outliers])
# drop outliers
df_commentActivityNoOutliers = df_commentActivity.drop(outliers)

# Use model without outliers
modelComment2 = smf.ols(formula='np.log(count) ~ year+square(year)+season', data=df_commentActivityNoOutliers).fit()
print(modelComment2.summary())

px.violin(x=modelComment2.fittedvalues, y=modelComment2.resid, title='Residuals vs. Fitted').show()
px.histogram(x=modelComment2.resid, title='Distribution of Residuals').show()
px.scatter(x=[i for i in range(len(modelComment2.resid))], y=modelComment2.resid, title='Residuals vs. Index')
[year       2010
month       Aug
count        19
season    Break
Name: 0, dtype: object, year       2010
month       Jun
count         6
season    Break
Name: 3, dtype: object]
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(count)   R-squared:                       0.837
Model:                            OLS   Adj. R-squared:                  0.832
Method:                 Least Squares   F-statistic:                     184.6
Date:                Mon, 16 Dec 2019   Prob (F-statistic):           2.37e-42
Time:                        04:09:32   Log-Likelihood:                -57.828
No. Observations:                 112   AIC:                             123.7
Df Residuals:                     108   BIC:                             134.5
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept          -5.367e+04   2.38e+04     -2.253      0.026   -1.01e+05   -6457.410
season[T.Semester]     0.2550      0.079      3.213      0.002       0.098       0.412
year                  52.9488     23.643      2.239      0.027       6.084      99.814
square(year)          -0.0131      0.006     -2.225      0.028      -0.025      -0.001
==============================================================================
Omnibus:                        1.291   Durbin-Watson:                   0.967
Prob(Omnibus):                  0.524   Jarque-Bera (JB):                0.804
Skew:                          -0.126   Prob(JB):                        0.669
Kurtosis:                       3.330   Cond. No.                     2.48e+12
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.48e+12. This might indicate that there are
strong multicollinearity or other numerical problems.

Investigating the outliers, it seems that they are from the inception of the subreddit. Because they differ from the other points in terms of their x-distance and their y-distance, they significantly affect the model and should be taken out. Because we have over 100 other observations, we should be fine.

After successfully removing the outliers, the graphs are centered better. Now, we will tackle independence.

Assumption 4: Independence

At first, the violin plot looks fine: all of the violins' centers fall between -1 and 1, and the spread for the most part is roughly the same throughout. However, looking at the centers of the violins, there is a certain up-and-down curvature, especially at the beginning. Looking at the scatter plot, this same curvature exists. There is still a slight pattern in our scatter plot, meaning the residuals are not independent with respect to time. In order to capture this relationship, we can add a lagged variable, meaning we can use the count of the previous month to predict the current month.

In [14]:
square = lambda x: x**2

modelComment = smf.ols(formula='np.log(count) ~ year+square(year)+season', data=df_commentActivity).fit()

test = modelComment.outlier_test()
outliers = [i for i,t in enumerate(test["bonf(p)"]) if t < 0.5]
df_commentActivityNoOutliers = df_commentActivity.drop(outliers)

df_commentActivityNoOutliersLagged = df_commentActivityNoOutliers.copy()
# Create lagged variable
df_commentActivityNoOutliersLagged["countLag"] = df_commentActivityNoOutliersLagged["count"].shift(-1)

modelComment2 = smf.ols(formula='np.log(count) ~ year+square(year)+season+np.log(countLag)', 
                       data=df_commentActivityNoOutliersLagged).fit()
print(modelComment2.summary())

px.scatter(x=modelComment2.fittedvalues, y=modelComment2.resid, title='Residuals vs. Fitted').show()
px.histogram(x=modelComment2.resid, title='Distribution of Residuals').show()
px.scatter(x=[i for i in range(len(modelComment2.resid))], y=modelComment2.resid, title='Residuals vs. Index')
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(count)   R-squared:                       0.873
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     182.6
Date:                Mon, 16 Dec 2019   Prob (F-statistic):           1.35e-46
Time:                        04:09:40   Log-Likelihood:                -42.697
No. Observations:                 111   AIC:                             95.39
Df Residuals:                     106   BIC:                             108.9
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept          -3.343e+04   2.16e+04     -1.550      0.124   -7.62e+04    9319.265
season[T.Semester]     0.1074      0.075      1.441      0.153      -0.040       0.255
year                  33.0063     21.398      1.542      0.126      -9.418      75.431
square(year)          -0.0081      0.005     -1.534      0.128      -0.019       0.002
np.log(countLag)       0.4751      0.083      5.755      0.000       0.311       0.639
==============================================================================
Omnibus:                       12.505   Durbin-Watson:                   1.882
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               17.381
Skew:                          -0.566   Prob(JB):                     0.000168
Kurtosis:                       4.574   Cond. No.                     2.53e+12
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.53e+12. This might indicate that there are
strong multicollinearity or other numerical problems.

Unfortunately, the violin plots are too fine here to be displayed, so we've used a scatter plot in place of the violin. The interpretation is the same: if we imagine splitting the dots horizontally into chunks, then each chunk should have a mean of zero and a spread that is equal throughout. Save for a few minor outliers, the assumptions have almost been satsified. Now, let's get rid of that collinearity error.

Assumption 5: Non-collinearity

This one is rather simple: year and square(year) have a strong collinearity because the latter is based on the former. We can center these variables by subtracting by their means. This will produce an equivalent model.

In [15]:
square = lambda x: x**2
# centering function
center = lambda x: x - x.mean()

modelComment = smf.ols(formula='np.log(count) ~ year+square(year)+season', data=df_commentActivity).fit()

test = modelComment.outlier_test()
outliers = [i for i,t in enumerate(test["bonf(p)"]) if t < 0.5]
df_commentActivityNoOutliers = df_commentActivity.drop(outliers)

df_commentActivityNoOutliersLagged = df_commentActivityNoOutliers.copy()
df_commentActivityNoOutliersLagged["countLag"] = df_commentActivityNoOutliersLagged["count"].shift(-1)

modelComment2 = smf.ols(formula='np.log(count) ~ center(year)+square(center(year))+season+np.log(countLag)', 
                       data=df_commentActivityNoOutliersLagged).fit()
print(modelComment2.summary())

px.scatter(x=modelComment2.fittedvalues, y=modelComment2.resid, title='Residuals vs. Fitted').show()
px.histogram(x=modelComment2.resid, title='Distribution of Residuals').show()
px.scatter(x=[i for i in range(len(modelComment2.resid))], y=modelComment2.resid, title='Residuals vs. Index')
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(count)   R-squared:                       0.873
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     182.6
Date:                Mon, 16 Dec 2019   Prob (F-statistic):           1.35e-46
Time:                        04:09:49   Log-Likelihood:                -42.697
No. Observations:                 111   AIC:                             95.39
Df Residuals:                     106   BIC:                             108.9
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept                3.7745      0.594      6.350      0.000       2.596       4.953
season[T.Semester]       0.1074      0.075      1.441      0.153      -0.040       0.255
center(year)             0.1783      0.030      5.999      0.000       0.119       0.237
square(center(year))    -0.0081      0.005     -1.534      0.128      -0.019       0.002
np.log(countLag)         0.4751      0.083      5.755      0.000       0.311       0.639
==============================================================================
Omnibus:                       12.505   Durbin-Watson:                   1.882
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               17.381
Skew:                          -0.566   Prob(JB):                     0.000168
Kurtosis:                       4.574   Cond. No.                         200.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Ta-da! Our model should be ready for interpretation. However, along the way, our season and year2 variables have become statistically insignificant. Nevertheless, season still definitely has an effect on the activity per month. The issue is probably that we have defined season too widely—perhaps it will become statistically significant if we narrow it to summer break only.

In [16]:
# Redefine "season" to be only summer break
df_commentActivity["summer"] = [True if m in ["Jun", "Jul", "Aug"] else False for m in df_commentActivity["month"]]

square = lambda x: x**2
center = lambda x: x - x.mean()

modelComment = smf.ols(formula='np.log(count) ~ year+square(year)+summer', data=df_commentActivity).fit()

test = modelComment.outlier_test()
outliers = [i for i,t in enumerate(test["bonf(p)"]) if t < 0.5]
df_commentActivityNoOutliers = df_commentActivity.drop(outliers)

df_commentActivityNoOutliersLagged = df_commentActivityNoOutliers.copy()
df_commentActivityNoOutliersLagged["countLag"] = df_commentActivityNoOutliersLagged["count"].shift(-1)

# Remove year^2 and fit summer instead of season
modelComment2 = smf.ols(formula='np.log(count) ~ center(year)+summer+np.log(countLag)', 
                       data=df_commentActivityNoOutliersLagged).fit()
print(modelComment2.summary())

px.scatter(x=modelComment2.fittedvalues, y=modelComment2.resid, title='Residuals vs. Fitted').show()
px.histogram(x=modelComment2.resid, title='Distribution of Residuals').show()
px.scatter(x=[i for i in range(len(modelComment2.resid))], y=modelComment2.resid, title='Residuals vs. Index')
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(count)   R-squared:                       0.878
Model:                            OLS   Adj. R-squared:                  0.875
Method:                 Least Squares   F-statistic:                     257.9
Date:                Mon, 16 Dec 2019   Prob (F-statistic):           8.26e-49
Time:                        04:09:54   Log-Likelihood:                -40.354
No. Observations:                 111   AIC:                             88.71
Df Residuals:                     107   BIC:                             99.55
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            3.7294      0.557      6.691      0.000       2.624       4.834
summer[T.True]      -0.2366      0.079     -2.999      0.003      -0.393      -0.080
center(year)         0.1754      0.028      6.301      0.000       0.120       0.231
np.log(countLag)     0.4900      0.076      6.442      0.000       0.339       0.641
==============================================================================
Omnibus:                       15.979   Durbin-Watson:                   1.951
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               24.161
Skew:                          -0.678   Prob(JB):                     5.67e-06
Kurtosis:                       4.839   Cond. No.                         124.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Voila! There is our final model, with all assumptions satisfied and all variables signficant. We can now interpret the coefficients:

  • During summer break, the number of comments on average will decrease by (1 - e-0.2366)%, or 21%, given all other variables are constant.
  • With an increase of one year, the number of comments on average will increase by (e0.1754 - 1)%, or 19%, given all other variables are constant.
  • On average, every 1% increase in the previous months' comments leads to a (e0.4904 - 1)%, or 0.63% increase in the current months' comments, given all other variables are constant.
In [17]:
df_postActivity["summer"] = [True if m in ["Jun", "Jul", "Aug"] else False for m in df_postActivity["month"]]

square = lambda x: x**2
center = lambda x: x - x.mean()

modelPost = smf.ols(formula='np.log(count) ~ year+square(year)+summer', data=df_postActivity).fit()

test = modelPost.outlier_test()
outliers = [i for i,t in enumerate(test["bonf(p)"]) if t < 0.5]
df_postActivityNoOutliers = df_postActivity.drop(outliers)

df_postActivityNoOutliersLagged = df_postActivityNoOutliers.copy()
df_postActivityNoOutliersLagged["countLag"] = df_postActivityNoOutliersLagged["count"].shift(-1)

# Remove year^2 and fit summer instead of season
modelPost2 = smf.ols(formula='np.log(count) ~ center(year)+summer+np.log(countLag)', 
                       data=df_postActivityNoOutliersLagged).fit()
print(modelPost2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(count)   R-squared:                       0.909
Model:                            OLS   Adj. R-squared:                  0.907
Method:                 Least Squares   F-statistic:                     357.5
Date:                Mon, 16 Dec 2019   Prob (F-statistic):           1.36e-55
Time:                        04:10:00   Log-Likelihood:                -34.085
No. Observations:                 111   AIC:                             76.17
Df Residuals:                     107   BIC:                             87.01
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            2.8961      0.398      7.282      0.000       2.108       3.684
summer[T.True]      -0.3129      0.075     -4.158      0.000      -0.462      -0.164
center(year)         0.1988      0.028      6.999      0.000       0.142       0.255
np.log(countLag)     0.4823      0.071      6.789      0.000       0.341       0.623
==============================================================================
Omnibus:                       38.673   Durbin-Watson:                   1.881
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               99.349
Skew:                          -1.293   Prob(JB):                     2.67e-22
Kurtosis:                       6.846   Cond. No.                         72.8
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can easily do the same for posts:

  • During summer break, the number of posts on average will decrease by (1 - e-0.3125)%, or 27%, given all other variables are constant.
  • With an increase of one year, the number of posts on average will increase by (e0.1988 - 1)%, or 22%, given all other variables are constant.
  • On average, every 1% increase in the previous months' posts leads to a (e0.4823 - 1)%, or 0.62% increase in the current months' posts, given all other variables are constant.

Time-series analysis: how long do users stay on the subreddit?

Exploring the idea of this data as a time-series, we can maybe investigate a few more trends. Using SQL, we can select group every user's posts, and find their first and last posts and comments. By combining this information, we can see a user's first and last activity (post or comment) on this subreddit.

In [18]:
conn = create_connection("./R_UMD.db")

df_startend = pd.read_sql(
    """SELECT name, 
    MIN(mn) as firstUTC, MAX(mx) as lastUTC, 
    MAX(mx) - MIN(mn) as durationUTC,
    DATETIME(MIN(mn), 'unixepoch') as firstDate, 
    DATETIME(MAX(mx), 'unixepoch') as lastDate, 
    JulianDay(DATETIME(MAX(mx), 'unixepoch')) - JulianDay(DATETIME(MIN(mn), 'unixepoch')) as durationDays FROM (
        SELECT name, MIN(created_utc) as mn, MAX(created_utc) as mx FROM Post GROUP BY name
        UNION
        SELECT name, MIN(created_utc) as mn, MAX(created_utc) as mx FROM Comment GROUP BY name
    ) t1 GROUP BY name ORDER BY durationDays DESC""", 
    conn
)

conn.close()

df_startend.head()
Out[18]:
name firstUTC lastUTC durationUTC firstDate lastDate durationDays
0 None 1.277515e+09 1.574025e+09 296510487.0 2010-06-26 01:13:33 2019-11-17 21:15:00 3431.834340
1 chrisg90 1.277440e+09 1.568899e+09 291459074.0 2010-06-25 04:21:00 2019-09-19 13:12:14 3373.368912
2 Blue_5ive 1.279951e+09 1.570563e+09 290611801.0 2010-07-24 05:58:33 2019-10-08 19:28:34 3363.562512
3 Ares__ 1.279777e+09 1.569695e+09 289917731.0 2010-07-22 05:44:03 2019-09-28 18:26:14 3355.529294
4 umd_charlzz 1.286715e+09 1.573569e+09 286854188.0 2010-10-10 12:45:30 2019-11-12 14:28:38 3320.071620

Using this structured data, we can start to plot the first and last activity for every user.

In [19]:
fig = px.histogram(
    x=[datetime.utcfromtimestamp(s).month for s in df_startend["firstUTC"]],
    title='First activity by month'
).show()

fig = px.histogram(
    x=[datetime.utcfromtimestamp(s).month for s in df_startend["lastUTC"]],
    title='Last activity by month'
).show()

This reveals a couple interesting trends. For both graphs, the spring semester is very similar to the fall semester, and thus, it seems that the cycle of activity is semester-by-semester, rather than year-by-year. Also, the shape of the semesters are different for first activity and last activity. The first activity seems to peak in the middle of the semester, perhaps suggesting that users slowly learn about Reddit, peaking during the middle of the semester. The last activity slowly increases as time goes on. This makes sense, as users are most likely to stop interacting with the subreddit as they graduate or go on break.

Let's see if we can get some statistics for how long users stay on the subreddit.

In [20]:
px.histogram(
    x=[timedelta(seconds=s).days for s in df_startend["durationUTC"]],
    title="Days actively participating in subreddit"
).show()

print("Mean duration: {}".format(timedelta(seconds=df_startend["durationUTC"].mean())))
print("Median duration: {}".format(timedelta(seconds=df_startend["durationUTC"].median())))
print("Standard deviation of duration: {}\n".format(timedelta(seconds=df_startend["durationUTC"].std())))

print("Users with a single activity: {}".format(df_startend["durationUTC"][df_startend["durationUTC"] == 0].count()))
print("Percentage of users with a single activity: {0:.2f}%".format( 
      df_startend["durationUTC"][df_startend["durationUTC"] == 0].count() * 100 / 
      df_startend["durationUTC"].count()))
Mean duration: 242 days, 16:47:35.999267
Median duration: 1 day, 13:24:33
Standard deviation of duration: 463 days, 8:59:34.014975

Users with a single activity: 7059
Percentage of users with a single activity: 36.94%

With over a third of the users only participating once, it's hard to see the bigger picture. Let's take them out and see the data again.

In [21]:
px.histogram(
    x=[timedelta(seconds=s).days for s in df_startend[df_startend["durationUTC"] > 0]["durationUTC"]],
    title="Days actively participating in subreddit (more than one post/comment)"
).show()

print("Mean duration: {}".format(timedelta(seconds=df_startend[df_startend["durationUTC"] > 0]["durationUTC"].mean())))
print("Median duration: {}".format(timedelta(seconds=df_startend[df_startend["durationUTC"] > 0]["durationUTC"].median())))
print("Standard deviation of duration: {}\n".format(
    timedelta(seconds=df_startend[df_startend["durationUTC"] > 0]["durationUTC"].std())))

secsInDay = 60*60*24;

print("Users only posting on a single day: {}".format(
    df_startend["durationUTC"][df_startend["durationUTC"] < secsInDay].count()))
print("Percentage of users with less than a single day of activity: {0:.2f}%".format( 
      df_startend["durationUTC"][df_startend["durationUTC"] < secsInDay].count() * 100 / 
      df_startend["durationUTC"].count()))
Mean duration: 384 days, 20:26:39.087454
Median duration: 156 days, 0:45:31
Standard deviation of duration: 534 days, 13:55:53.371681

Users only posting on a single day: 9325
Percentage of users with less than a single day of activity: 48.79%

The distribution is again heavily skewed. Further analysis reveals that almost half of users never post for more than one day.

Let's try one more time, filtering out all users who only have activity on one day.

In [22]:
px.histogram(
    x=[timedelta(seconds=s).days for s in df_startend[df_startend["durationUTC"] > secsInDay]["durationUTC"]],
    title="Days actively participating in subreddit (more than one day of posting/commenting)"
).show()

print("Mean duration: {}".format(
    timedelta(seconds=df_startend[df_startend["durationUTC"] > secsInDay]["durationUTC"].mean())))
print("Median duration: {}".format(
    timedelta(seconds=df_startend[df_startend["durationUTC"] > secsInDay]["durationUTC"].median())))
print("Standard deviation of duration: {}\n".format(
    timedelta(seconds=df_startend[df_startend["durationUTC"] > 0]["durationUTC"].std())))
Mean duration: 473 days, 21:49:15.601063
Median duration: 277 days, 8:32:13
Standard deviation of duration: 534 days, 13:55:53.371681

Out of curiousity, let's plot the activity of users at arbitrary breakpoints: 1 post, 1 day, 1 month, 1 semester, 1 year, 2 years, 4 years.

In [23]:
pieSlices = [
len(df_startend[df_startend["durationUTC"] == 0]),
len(df_startend[(df_startend["durationUTC"] > 0) & (df_startend["durationUTC"] < secsInDay)]),
len(df_startend[(df_startend["durationUTC"] > secsInDay) & (df_startend["durationUTC"] < secsInDay * 30)]),
len(df_startend[(df_startend["durationUTC"] > secsInDay * 30) & (df_startend["durationUTC"] < secsInDay * 180)]),
len(df_startend[(df_startend["durationUTC"] > secsInDay * 180) & (df_startend["durationUTC"] < secsInDay * 365)]),
len(df_startend[(df_startend["durationUTC"] > secsInDay * 365) & (df_startend["durationUTC"] < secsInDay * 365 * 2)]),
len(df_startend[(df_startend["durationUTC"] > secsInDay * 365 * 2) & (df_startend["durationUTC"] < secsInDay * 365 * 4)]),
len(df_startend[(df_startend["durationUTC"] > secsInDay * 365 * 4)]),
]
labels = \
    ["1 post", "<1 day", "1 day - 1 month", "1 month - 6 months", "6 months - year", "1-2 years", "2-4 years", ">4 years"]

total = 0
text = []
for sl in pieSlices:
    total += sl
    text.append(total)
    
text = ["{0:.2f}%".format(t / total * 100) for t in text]

go.Figure(data=[go.Pie(labels=labels, values=pieSlices, text=text, sort=False)]) \
  .update_traces(hoverinfo='label+percent', textinfo='text', textfont_size=14) \
  .show()

The chart shows the cumulative percentage of users who have posted less than the given timeframe on the slice. Hovering over any slice will give you the size of the slice, as a percentage.

Given this chart, we can see that 88.2% of users are active less than 2 years. This dispels the notion that there is a common archetype of 4-year users (users who join their freshman year and leave their senior year). 69.6% of users are constrained within the time frame of a single semester.

Categorizing Post Content

Next, we will be categorizing all of r/UMD's posts into different groups. No posts are labeled already, and in fact the categories are not defined yet. We will be using Scikit-Learn's KMeans algorithm, which requires us to properly prepare our text dataset and create a TF-IDF matrix. This will be an example of unsupervised machine learning, as we do not have a labeled dataset to test the KMeans model.

First, we need to define a function to clean up our text. This function breaks a line into tokens, reduces tokens to their stems, and removes stopwords, which are words that don't add anything to the post's meaning (i.e. articles). This effectively sanitizes our text and makes it about as uniform as possible.

In [24]:
#create stemmer and stopwords
ps = PorterStemmer()
words = stopwords.words('english')

#breaks words into stems, forces them into lowercase, tokenizes based on whitespace, and removes stopwords
def clean(x) :
    return ' '.join([ps.stem(i) for i in re.sub('[^a-zA-Z]', ' ', x).split() if i not in words]).lower()

Now that we have defined our tokenization function, we can use it to clean up our dataset. Let's go ahead and apply it to both the title and the body of the post, and save the "clean" versions as new columns. While we're at it, we can add a new column which contains the title appended to the body, since for most of our analysis, we will treat the combination of title and body as the text of each post. This weights the title and the post body equally in determining the content.

In [28]:
#clean the title and text by applying the clean function as stated above
title_clean = df_post['title'].apply(clean)
text_clean = df_post['selftext'].apply(clean)

#concatenates the cleaned title and text to create a column with all words per post
df_post['doc'] = title_clean.map(str) + text_clean

In order to create a TF-IDF matrix, we will use Scikit-Learn's TfidfVectorizer package. Since we have already cleaned our data, all we have to do is create a new TfidfVectorizer, convert the post texts to a list and fit the vectorizer, and construct a new dataframe from the result.

In [29]:
titles = df_post['title'].tolist()
corpus = df_post['doc'].tolist()

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
df_post_tfidf = pd.DataFrame(X.T.todense(), index=vectorizer.get_feature_names(), columns = titles)

Now that we have a TF-IDF matrix, we can use Scikit-Learn's KMeans function to split the data up into clusters based on similarities in text from the TF-IDF matrix.

Unfortunately, KMeans is a non-deterministic algorithm, meaning that it will give a different result if run multiple times. This can lead to some interesting results, as the clusters change each time the algorithm is run. This can be alleviated by providing an integer for a random seed, which will cause the KMeans to give the same result every time. Though the selection is trivial, we have chosen 971 as our random seed.

Here we are specifying k = 15 to categorize the data into 15 clusters. We need to manually inspect each cluster to see what the posts in each cluster have in common, and we can give names to our clusters.

In [30]:
# using KMeans, cluster the data into a set number of categories
true_k = 15
r = 971 # KMeans is non-deterministic unless we specify the random seed
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1, random_state = r)

# fit the model
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind])
    
# free up extra space
del df_post_tfidf
Top terms per cluster:
Cluster 0:
 major
 cs
 scienc
 comput
 engin
 doubl
 minor
 program
 school
 career
Cluster 1:
 transfer
 umd
 student
 school
 credit
 gpa
 appli
 fall
 colleg
 semest
Cluster 2:
 math
 class
 calc
 exam
 major
 stat
 semest
 cours
 placement
 anyon
Cluster 3:
 ticket
 game
 student
 michigan
 basketbal
 anyon
 sell
 extra
 guest
 state
Cluster 4:
 maryland
 like
 terp
 student
 look
 time
 need
 help
 peopl
 final
Cluster 5:
 park
 colleg
 lot
 permit
 free
 car
 campu
 dot
 overnight
 summer
Cluster 6:
 room
 hous
 look
 apart
 common
 live
 roommat
 leas
 rent
 sublet
Cluster 7:
 cmsc
 class
 semest
 anyon
 summer
 math
 exam
 taken
 cs
 cours
Cluster 8:
 campu
 place
 south
 live
 job
 know
 hous
 look
 best
 anyon
Cluster 9:
 umd
 student
 edu
 http
 school
 connect
 like
 know
 email
 use
Cluster 10:
 talk
 week
 happen
 promot
 jpg
 sport
 upcom
 com
 http
 event
Cluster 11:
 cours
 credit
 class
 semest
 onlin
 taken
 anyon
 summer
 level
 grade
Cluster 12:
 easi
 class
 credit
 gen
 ene
 cours
 need
 ed
 onlin
 level
Cluster 13:
 class
 semest
 waitlist
 credit
 drop
 taken
 like
 anyon
 professor
 grade
Cluster 14:
 anyon
 know
 doe
 want
 els
 like
 thank
 ha
 look
 taken

After looking at the clusters, we have decided on some appropirate titles for each group:

In [31]:
subjects = {0 : 'major requirements', 1 : 'admissions / trasfer', 2 : 'math', 3 : 'sports', 4 : 'general umd',
            5 : 'parking', 6 : 'housing', 7 : 'cs classes', 8 : 'housing', 9 : 'events / internet', 10 : 'weekly posts',
            11 : 'registration', 12 : 'course / campus questions'}

Now our KMeans model is trained! We can test it out by taking a random sample of the data and predicting each post's category by using the model's predict() function. We have printed this output below so it can be seen how accurate the groupings are.

In [32]:
# Unwraps the prediction from the model and looks up the category string in the dictionary, as well as grouping
# classifications with similar characteristics.
def classify(post) :
    Y = vectorizer.transform([post])
    prediction = model.predict(Y)[0]
    if prediction == 12 :
        prediction = 11
    if prediction > 12 :
        prediction = 12
    return prediction
In [33]:
#create random sample of dataframe
sample = df_post.sample(n=40)
header_str = '~~~~~~~~~~'
#sample = df
pred = []
#add column for the prediction to the dataframe
for row in sample.iterrows() : 
    pred.append(classify(row[1]['doc']))
sample['pred'] = pred
#display sample posts by subject
for i in range(0,13) :
        print()
        print(header_str,subjects[i],header_str)
        sub = sample[sample['pred'] == i]
        for row in sub.iterrows() :
            print(row[1]['title'])
~~~~~~~~~~ major requirements ~~~~~~~~~~
Possible to take CMSC250 at community college for CS?
Chemical and Materials Science Engineering
Smiths Business school question

~~~~~~~~~~ admissions / trasfer ~~~~~~~~~~

~~~~~~~~~~ math ~~~~~~~~~~
Looking for Calc 1(MATH140) tutor

~~~~~~~~~~ sports ~~~~~~~~~~
Selling penn state ticket message me

~~~~~~~~~~ general umd ~~~~~~~~~~
Lost key on north campus
FSU rumored to accept move to Big XII, rumors being denied that GT to the B1G. This would validate our move to the B1G and the fact that the ACC shot itself in the foot by forsaking football for basketball. 
y’all we got a situation
Smaller Dining Plans?
How did Bowie State get the First Lady and we got a retired baseball player?
William fucking Likely
Shuttle Routes 141 Gaithersburg & 142 Columbia
Join Philosophy Club!
Can someone ELI5 what an unsub direct loan is?
Food Options?
Final grades
Community Event in Response to the Anti-Transgender Memo
Selling Michigan Ticket
Harrison twins trims the list down to 3 schools.. Lets cross our fingers
anyone have a coursehero account?
Please fix eduroam.
Stefon Diggs
Thinking about taking a semester off
Could it be another ENGL 393 survey? Survey says...yes
Commuter thinking about living off campus
Network Outage this Friday
Anybody wanna binge watch some Rick and Morty to help for these finals?
Lost iPhone, Purple Bus
Dining plan

~~~~~~~~~~ parking ~~~~~~~~~~
Parking Permit

~~~~~~~~~~ housing ~~~~~~~~~~
Can you swap rooms multiple times on Room Exchange?
Courtyards Room Available
Grad Students/PhD Students - where do you live?

~~~~~~~~~~ cs classes ~~~~~~~~~~

~~~~~~~~~~ housing ~~~~~~~~~~
How competitive are on campus part time jobs? Specifically in the IT Department?

~~~~~~~~~~ events / internet ~~~~~~~~~~
UMD Mens Ultimate Frisbee Club
UMD Alerts

~~~~~~~~~~ weekly posts ~~~~~~~~~~

~~~~~~~~~~ registration ~~~~~~~~~~

~~~~~~~~~~ course / campus questions ~~~~~~~~~~
Does anyone know anything about BSCI339Q: Diseases Due to Dysfunctional Organelles, with Ades?
anyone else transferring to UMD here that knows what to do next.
Need info on the following professors for physics
Lost iTouch at McKeldin 10/09 around 9:53-10:00 PM

Since we can see that the categorization is accurate, we can go ahead and append a new column to our posts dataframe with their classification, and we are done with categorizing the posts of r/UMD!

In [34]:
classifications = []

# add classification for every row
for row in df_post.iterrows() :
    classifications.append(subjects[classify(row[1]['doc'])])
df_post['class'] = classifications
df_post.drop(['doc'], axis = 1) # don't need this anymore
df_post.head(5)
Out[34]:
id name url title selftext score created_utc permalink link_flair_text doc class
0 dxv1c4 Baking-and-books https://www.reddit.com/r/UMD/comments/dxv1c4/r... Re-Leasing Apartment! Re-Leasing my room in Commons 6. Amazing room... 1 1.574036e+09 /r/UMD/comments/dxv1c4/releasing_apartment/ Housing re leas apartre leas room common amaz roommat ... housing
1 dxuxzp cdrgnvrk https://www.reddit.com/r/UMD/comments/dxuxzp/e... Eduroam ACTUALLY sucks dick Fuck the division of IT for allowing this bull... 1 1.574035e+09 /r/UMD/comments/dxuxzp/eduroam_actually_sucks_... None eduroam actual suck dickfuck divis it allow bu... general umd
2 dxuwpy TonyChen616 https://v.redd.it/z0rvpzqi4cz31 OG Legends Strikes Again 4 1.574035e+09 /r/UMD/comments/dxuwpy/og_legends_strikes_again/ Discussion og legend strike again general umd
3 dxu8we Shalleycat https://www.reddit.com/r/UMD/comments/dxu8we/s... Sustainable turtle sticker Anyone know where I can get one of those susta... 1 1.574032e+09 /r/UMD/comments/dxu8we/sustainable_turtle_stic... None sustain turtl stickeranyon know i get one sust... general umd
4 dxttl0 Rooser1212 https://www.reddit.com/r/UMD/comments/dxttl0/s... Spring/Summer 2020 Sublease I am studying abroad next semester and am look... 1 1.574030e+09 /r/UMD/comments/dxttl0/springsummer_2020_suble... Housing spring summer subleasi studi abroad next semes... housing

We can also use Plotly to create a pie chart, which can be a nice visual aid to show the breakdown of the post categories across the entire subreddit.

To do this, we must iterate through the table and tally up each category.

In [35]:
subject_count = {'major requirements' : 0, 'admissions / trasfer' : 0, 'math' :0, 'sports' : 0, 'general umd' : 0,
            'parking' : 0, 'housing' : 0, 'cs classes' : 0, 'housing' : 0, 'events / internet' : 0, 'weekly posts' : 0,
            'registration' : 0, 'course / campus questions' : 0}

# tally up each category
for row in df_post.iterrows() :
    subject_count[row[1]['class']] += 1
In [36]:
# Create temporary dataframe for use of Plotly
df_temp = pd.DataFrame()
df_temp['classification'] = subject_count.keys()
df_temp['count'] = subject_count.values()
fig = px.pie(df_temp, values = 'count', names='classification', title='Classification of r/UMD Posts by Percent')
fig.show()

Looking at the breakdown of posts, there are obviously a few topics much more common than others. Of course, posts about UMD take the lead, as this is a UMD-themed subreddit. There is a massive distribution of posts regarding advice about courses, as well as registration, housing, and campus events.

Collective Sentiment Analysis of Posts

Next, we will look at all the posts of r/UMD and analyze their sentiments, classifying them as positive, negative, or neutral.

To do this, we will be using the VADER (Valence Aware Dictionary and sEntiment Reasoner) algorithm, which is a pre-trained model that specializes in sentiment analysis of social media posts.

VADER takes in a string and returns 4 scores: positive, neutral, negative, and compound. The first 3 reflect the percent of the string made up of positive, negative, and neutral keywords. These scores always add up to 1. Compound score is a composite of the first 3, between -1 and 1, which is normalized to account for context, length, and emphasis of the words. We define a score of >= 0.05 as positive, < 0.05 && > -0.05 as neutral, and <= 0.05 as negative, in accordance with the VADER guidelines.

Firstly, we define a simple function to return 'positive', 'negative', or 'neutral' based on the composite score.

In [37]:
analyzer = SentimentIntensityAnalyzer()

def classify_sentiment(sentence) :
    # four-part score from VADER
    score = analyzer.polarity_scores(sentence)
    if score['pos'] >= 0.05 :
        return 'positive'
    if score['neg'] <= 0.05 :
        return 'negative'
    return 'neutral'

We need to iterate through all posts and run this function to get a sentiment for each post. We can also add a new column to the table corresponding to the sentiment of each post.

In [38]:
sentiments = []

for row in df_post.iterrows() :
    p = classify_sentiment(row[1]['title'] + ' ' + row[1]['selftext'])
    sentiments.append(p)
    
df_post['sentiment'] = sentiments
df_post.head(5)
Out[38]:
id name url title selftext score created_utc permalink link_flair_text doc class sentiment
0 dxv1c4 Baking-and-books https://www.reddit.com/r/UMD/comments/dxv1c4/r... Re-Leasing Apartment! Re-Leasing my room in Commons 6. Amazing room... 1 1.574036e+09 /r/UMD/comments/dxv1c4/releasing_apartment/ Housing re leas apartre leas room common amaz roommat ... housing positive
1 dxuxzp cdrgnvrk https://www.reddit.com/r/UMD/comments/dxuxzp/e... Eduroam ACTUALLY sucks dick Fuck the division of IT for allowing this bull... 1 1.574035e+09 /r/UMD/comments/dxuxzp/eduroam_actually_sucks_... None eduroam actual suck dickfuck divis it allow bu... general umd neutral
2 dxuwpy TonyChen616 https://v.redd.it/z0rvpzqi4cz31 OG Legends Strikes Again 4 1.574035e+09 /r/UMD/comments/dxuwpy/og_legends_strikes_again/ Discussion og legend strike again general umd neutral
3 dxu8we Shalleycat https://www.reddit.com/r/UMD/comments/dxu8we/s... Sustainable turtle sticker Anyone know where I can get one of those susta... 1 1.574032e+09 /r/UMD/comments/dxu8we/sustainable_turtle_stic... None sustain turtl stickeranyon know i get one sust... general umd positive
4 dxttl0 Rooser1212 https://www.reddit.com/r/UMD/comments/dxttl0/s... Spring/Summer 2020 Sublease I am studying abroad next semester and am look... 1 1.574030e+09 /r/UMD/comments/dxttl0/springsummer_2020_suble... Housing spring summer subleasi studi abroad next semes... housing positive

Now, just as before, we can create a pie chart to illustrate the sentiment distribution of r/UMD using Plotly.

In [39]:
sent_count = {'positive' : 0, 'negative' : 0, 'neutral' : 0}
# iterate through the table and get sentiments
for row in df_post.iterrows() :
    sent_count[row[1]['sentiment']] += 1

# plot the pie chart
df_temp = pd.DataFrame()
df_temp['sentiment'] = sent_count.keys()
df_temp['count'] = sent_count.values()
fig = px.pie(df_temp, values = 'count', names='sentiment', title='Sentiment of r/UMD Posts by Percent')
fig.show()

This pie chart shows us that most of the posts in r/UMD tend to have a positive sentiment.

But wait, there's more! Since it was so straightforward to perform a sentiment analysis on the posts, let's repeat the same process for the comments.

First, let's append a sentiment column to the comments dataframe.

In [40]:
sentiments = []
for row in df_comment.iterrows() :
    p = classify_sentiment(row[1]['body'])
    sentiments.append(p)
    
df_comment['sentiment'] = sentiments
df_comment.head(5)
Out[40]:
id name body score parent_id link_id created_utc sentiment
0 f7wxk0a DeltaHex106 lol nooiiiccee. 4 t3_dxuwpy t3_dxuwpy 1.574041e+09 positive
1 f7x0i9l The_Joker_07 It says occurred on October 17, what???? 1 t3_dxuwpy t3_dxuwpy 1.574043e+09 negative
2 f7x0mwv YaBoiAtUMD 💀💀💀 1 t3_dxuwpy t3_dxuwpy 1.574043e+09 negative
3 f7w5tlk Thedaniel4999 I never had Dixon but can confirm that Stocker... 1 t3_dxtdkl t3_dxtdkl 1.574030e+09 positive
4 f7w95pd lordkaramat Yea, you need to show a student ID when you ge... 4 t3_dxt72p t3_dxt72p 1.574031e+09 negative

And now, as before, we will create our pie chart showing the sentiment of the comments.

In [41]:
sent_count = {'positive' : 0, 'negative' : 0, 'neutral' : 0}
for row in df_comment.iterrows() :
    sent_count[row[1]['sentiment']] += 1

df_temp = pd.DataFrame()
df_temp['sentiment'] = sent_count.keys()
df_temp['count'] = sent_count.values()
fig = px.pie(df_temp, values = 'count', names='sentiment', title = 'Sentiment of r/UMD Comments by Percent')
fig.show()

As can be seen in the above pie chart, the majority of the comments are positive, which may indicate that the r/UMD community has mostly supportive commentary.

Creating a dataframe containing every word ever written on r/UMD

In the following cell, we will tokenize every single word ever written in a post title, post description, or comment on r/UMD, and put them all in a single dataframe. This dataframe will have a column for the word, a column for the source of the word (post title, post description, or comment), a column for the username of the person who wrote the word, and a column for the date and time at which the word was posted.

To make the process more efficient, we will be collecting the words in lists of dictionaries that we will then add to a dataframe (this is faster than individually adding each word to the dataframe). We will first do this with the posts (both titles and descriptions), and then we will repeat the process with the comments. In order to successfully run this code, the "del" command was used to manually induce garbage collection to free up more memory, and the total memory allocated to the Docker container was doubled in the Docker settings. Because of the immense task this code performs, at the end of the block, we save it as a .CSV so that if the kernel restarts, we don't have to rerun this entire block of code.

In [42]:
# Create tokenizer based on a regular expression that filters out punctuation
# Includes apostrophes for contractions, hyphenated words, and periods for decimals
tokenizer = RegexpTokenizer('\d\.\d+|\w+[\'-]?\w*|\$?\d+\.\d+')

# Make list for the title and list for the description
# Each list will be a list of dictionaries that will then be converted to a dataframe
title_words_list = []
desc_words_list = []
for index, row in df_post.iterrows():
    # tokenize the title
    title_tokens = tokenizer.tokenize(row['title'])
    # get parts of speech
    title_tokens = nltk.pos_tag(title_tokens)
    # tokenize the description
    desc_tokens = tokenizer.tokenize(row['selftext'])
    # get parts of speech
    desc_tokens = nltk.pos_tag(desc_tokens)
    
    for title_tok in title_tokens:
        # key = col_name
        title_dict = {}
        # Convert each word to lower-case so that varied capitalization doesn't interfere with our word counts later
        title_dict['word'] = title_tok[0].lower()
        title_dict['pos'] = title_tok[1]
        title_dict['source'] = 'title'
        title_dict['user'] = row['name']
        title_dict['sentiment'] = row['sentiment']
        title_dict['date'] = row['created_utc']
        # add created date again, but this time just the date rather than the date and time (we'll use this later)
        created = time.localtime(row['created_utc'])
        title_dict['date_ymd'] = dt.datetime(created.tm_year, created.tm_mon, created.tm_mday).timestamp()
        title_words_list.append(title_dict)

    for desc_tok in desc_tokens:
        # key = col_name
        desc_dict = {}
        # Convert each word to lower-case so that varied capitalization doesn't interfere with our word counts later
        desc_dict['word'] = desc_tok[0].lower()
        desc_dict['pos'] = desc_tok[1]
        desc_dict['source'] = 'description'
        desc_dict['user'] = row['name']
        desc_dict['sentiment'] = row['sentiment']
        desc_dict['date'] = row['created_utc']
        # add created date again, but this time just the date rather than the date and time (we'll use this later)
        created = time.localtime(row['created_utc'])
        desc_dict['date_ymd'] = dt.datetime(created.tm_year, created.tm_mon, created.tm_mday).timestamp()
        desc_words_list.append(desc_dict)

# Add the words from the titles and the descriptions to a dataframe
words_frame = pd.DataFrame(title_words_list)
words_frame = words_frame.append(pd.DataFrame(desc_words_list))

# Clear up memory
del desc_words_list
del title_words_list

print("All words from posts and descriptions successfully added to dataframe.")

# Function to get the words from all the comments.
# This function will be called several separate times to deal with the memory issues,
# allowing us to clear up memory between each call.
def get_words_from_comments(comm_words_list, start, end):
    # Make list for the comments
    count = start
    if(end > df_comment.shape[0]):
        end = df_comment.shape[0]
    for index, row in df_comment[start:end].iterrows():
        # tokenize the comment
        comm_tokens = tokenizer.tokenize(row['body'])
        comm_tokens = nltk.pos_tag(comm_tokens)

        for comm_tok in comm_tokens:
            # key = col_name
            comm_dict = {}
            # Convert each word to lower-case so that varied capitalization doesn't interfere with our word counts later
            comm_dict['word'] = comm_tok[0].lower() 
            comm_dict['pos'] = comm_tok[1]
            comm_dict['source'] = 'comment'
            comm_dict['user'] = row['name']
            comm_dict['sentiment'] = row['sentiment']
            comm_dict['date'] = row['created_utc']
            # add created date again, but this time just the date rather than the date and time (we'll use this later)
            created = time.localtime(row['created_utc'])
            desc_dict['date_ymd'] = dt.datetime(created.tm_year, created.tm_mon, created.tm_mday).timestamp()
            comm_words_list.append(comm_dict)

        # keep track of count to show progress for sanity -- this code 
        count += 1
        if(count % 10000 == 0):
            print("Processed comments:", count)
    return comm_words_list


i = 0
while i < df_comment.shape[0]:
    # We will process 30,000 comments at a time
    i = i + 30000
    comm_wordlist = get_words_from_comments([],i - 30000,i)
    # Add the words from the comments to the dataframe
    words_frame = words_frame.append(pd.DataFrame(comm_wordlist), sort=True)
    # Clear up memory
    del comm_wordlist

print("Total number of words in r/UMD:", len(words_frame))
words_frame.head()

# Save the dataframe as a .CSV so that we don't have to rerun all this code if the kernel restarts
words_frame.to_csv(path_or_buf='all_words_in_r_umd.csv', index=False)
All words from posts and descriptions successfully added to dataframe.
Processed comments: 10000
Processed comments: 20000
Processed comments: 30000
Processed comments: 40000
Processed comments: 50000
Processed comments: 60000
Processed comments: 70000
Processed comments: 80000
Processed comments: 90000
Processed comments: 100000
Processed comments: 110000
Processed comments: 120000
Processed comments: 130000
Processed comments: 140000
Processed comments: 150000
Processed comments: 160000
Processed comments: 170000
Processed comments: 180000
Processed comments: 190000
Processed comments: 200000
Processed comments: 210000
Processed comments: 220000
Processed comments: 230000
Total number of words in r/UMD: 9348073

We will avoid running the above cell unless the database containing all the r/UMD data has been updated. Otherwise, we will simply read from the CSV file that the above cell generates, as is done in the following cell.

In [43]:
# The "del words_frame" is included to allow us to read from the .CVS completely fresh.
# We've put it in a try-except block so that the cell can be fully evaluated regardless of whether words_frame is defined.
try:
    del words_frame
except:
    pass

words_frame = pd.read_csv('./all_words_in_r_umd.csv')

print("Total number of words in r/UMD:", len(words_frame))
words_frame.head()
Total number of words in r/UMD: 9348073
Out[43]:
date date_ymd pos sentiment source user word
0 1.574036e+09 1.574035e+09 JJ positive title Baking-and-books re-leasing
1 1.574036e+09 1.574035e+09 NN positive title Baking-and-books apartment
2 1.574035e+09 1.573949e+09 NNP neutral title cdrgnvrk eduroam
3 1.574035e+09 1.573949e+09 NNP neutral title cdrgnvrk actually
4 1.574035e+09 1.573949e+09 VBZ neutral title cdrgnvrk sucks

Now that we have dataframes full of every post, comment, and word ever posted on r/UMD, let's find out some fun stuff.

In the following cells, we will count the number of words that each user has written, the number of posts each user has posted, the number of comments each user has commented, and the total karma each user has accumulated within r/UMD, put them all into dataframes, and write it all to three .CSV files. We will display the top 11 users for each category in tables, and visualize the top 50 users in bar charts using Plotly.

In [44]:
# Get the most verbose users:
most_verbose = pd.DataFrame(words_frame['user'].value_counts()).reset_index()
most_verbose = most_verbose.rename(columns={'index':'user', 'user':'num_words'})
most_verbose.to_csv('most_verbose_users.csv', index=False)
print('Most Verbose Users:')
display(most_verbose.head(11))

# Truncate the dataframe to exclude "None" for the plot
most_verbose_trunc = most_verbose.truncate(before=1).head(50)
most_verbose_fig = go.Figure(
    data=[go.Bar(x=most_verbose_trunc['user'], y=most_verbose_trunc['num_words'])],
    layout_title_text="Most Verbose Users"
)
most_verbose_fig.update_yaxes(title_text = 'Number of Words')
most_verbose_fig.update_xaxes(title_text = 'User')#, range=[0.5,50])
most_verbose_fig.show()
Most Verbose Users:
user num_words
0 None 300349
1 Miseryy 93431
2 uldu 86152
3 worldchrisis 59779
4 umdit 44233
5 MovkeyB 38728
6 umdnocguy 37550
7 umd_charlzz 36880
8 ThisGoldAintFree 36520
9 TheLeesiusManifesto 34498
10 Blue_5ive 33854
In [45]:
# Get the most-posting users
most_posts = pd.DataFrame(df_post['name'].value_counts()).reset_index()
most_posts = most_posts.rename(columns={'index':'user', 'name':'num_posts'})
most_posts.to_csv('most_posting_users.csv', index=False)
print('Most Posting Users:')
display(most_posts.head(11))

# Truncate the dataframe to exclude "None" from the plot
most_posts_trunc = most_posts.truncate(before=1).head(50)
most_posts_fig = go.Figure(
    data=[go.Bar(x=most_posts_trunc['user'], y=most_posts_trunc['num_posts'])],
    layout_title_text="Most Posting Users"
)
most_posts_fig.update_xaxes(title_text='User')
most_posts_fig.update_yaxes(title_text='Number of Posts')
most_posts_fig.show()
Most Posting Users:
user num_posts
0 None 12121
1 AutoModerator 225
2 umdit 154
3 RK_229542 140
4 lightintheblinds 124
5 Goozombies 116
6 Vu004 87
7 t3rps 82
8 Balderdasheries 80
9 Toasted_FlapJacks 70
10 transfr-umd19 70
In [46]:
# Get the most-commenting users
most_comments = pd.DataFrame(df_comment['name'].value_counts()).reset_index()
most_comments = most_comments.rename(columns={'index':'user', 'name':'num_comments'})
most_comments.to_csv('most_commenting_users.csv', index=False)
print('Most Commenting Users:')
display(most_comments.head(11))

# Truncate the dataframe to exclude "None" from the plot
most_comments_trunc = most_comments.truncate(before=1).head(50)
most_comments_fig = go.Figure(
    data=[go.Bar(x=most_comments_trunc['user'], y=most_comments_trunc['num_comments'])],
    layout_title_text="Most Commenting Users"
)
most_comments_fig.update_xaxes(title_text='User')
most_comments_fig.update_yaxes(title_text='Number of Comments')
most_comments_fig.show()
Most Commenting Users:
user num_comments
0 None 8167
1 worldchrisis 1770
2 uldu 1460
3 Miseryy 1246
4 Blue_5ive 1135
5 MovkeyB 1072
6 pahoodie 961
7 ThisGoldAintFree 840
8 turtle_stank 795
9 OddaJosh 747
10 CStruggle 730
In [47]:
# Get the users with the most total karma from all their posts and comments on r/UMD
# Concatenate the df_post and df_comment dataframes, then group by the name, and then sum the karma
grouped_karma = pd.concat([df_post, df_comment], sort=False).groupby('name').sum()
# Sort by the score
sorted_karma = grouped_karma.sort_values('score', ascending=False).reset_index()
# Clean up our table
sorted_karma = sorted_karma.rename(columns={'name':'user', 'score':'karma'})
sorted_karma = sorted_karma.drop(columns=['created_utc'], axis=1)

display(sorted_karma.head(11))

# Truncate to remove None and get the first 50 to plot
most_karma_trunc = sorted_karma.truncate(before=1).head(50)
most_karma_fig = go.Figure(
    data=[go.Bar(x=most_karma_trunc['user'], y=most_karma_trunc['karma'])],
    layout_title_text="Users with the Most Karma Accrued in r/UMD"
)
most_karma_fig.update_xaxes(title_text='User')
most_karma_fig.update_yaxes(title_text='Amount of Karma')
most_karma_fig.show()
user karma
0 None 98812
1 turtle_stank 8664
2 MovkeyB 7787
3 PoshLagoon 7662
4 MischaTheJudoMan 7340
5 Miseryy 6912
6 worldchrisis 6908
7 Goozombies 6708
8 thelorax18 5971
9 ericmm76 5938
10 Cap_g 5671

As is evident from the tables, the most verbose user and the user with the most posts, comments, and karma is "None," which is actually not a single user, but the collection of users that have since had their accounts deactivated, thus causing them to be listed as "None" in the dataframes. Thus, for the plots, we have excluded "None" by simply truncating the dataframes to exclude their first indices.

Notably, u/AutoModerator is ranked highest for number of posts, as it is a bot that makes weekly "This Week At UMD" posts. Ranked fourth in verboseness, second in posts, and eleventh in comments is u/umdit, the official Reddit account for UMD's IT department, which regularly posts about and responds to posts relating to IT issues. u/UMD_DOTS is also in the top 50 most verbose users for a similar reason due to posts and comments about transportation issues.

It is also easy to see that many of the users appear in each of the above plots (for instance, u/turtle_stank, u/MovkeyB, u/Miseryy, u/worldchrisis, and u/uldu to name a few). This makes sense, because in order to get a high verboseness score, a user would need to make a significant amount of comments and posts, which would usually cause the user to accumulate a large amount of karma.

Users with the Least Amount of Karma on r/UMD

In [52]:
# Get the users with the least total karma from all their posts and comments on r/UMD
# Sort by the score, least to most
sorted_karma = grouped_karma.sort_values('score', ascending=True).reset_index()
# Clean up our table
sorted_karma = sorted_karma.rename(columns={'name':'user', 'score':'karma'})
sorted_karma = sorted_karma.drop(columns=['created_utc'], axis=1)

display(sorted_karma.head(11))

# Truncate to remove None and get the first 50 to plot
least_karma_trunc = sorted_karma.head(50)
least_karma_fig = go.Figure(
    data=[go.Bar(x=least_karma_trunc['user'], y=least_karma_trunc['karma'])],
    layout_title_text="Users with the Lowest Karma Accrued in r/UMD"
)
least_karma_fig.update_xaxes(title_text='User')
least_karma_fig.update_yaxes(title_text='Amount of Karma')
least_karma_fig.show()
user karma
0 cms2337 -243
1 VectorMarketingRep -239
2 beast_mode_fortnite -216
3 nopetrol -214
4 idsardi -167
5 zen_veteran -152
6 younghostility -147
7 GoCuse -138
8 Souflay_Boi -137
9 vawksal -134
10 Meditos -125

Many of these users are likely trolls, however, u/VectorMarketingRep is a representative of the multi-level marketing scam known as Vector Marketing. Judging by the immensely negative karma that u/VectorMarketingRep has, it is safe to say that most users of r/UMD are aware of the pyramid scheme that has been run by Vector Marketing.

We will now define functions that will give a user's rank within each category (verboseness, number of posts, and number of comments).

In [53]:
# Verboseness ranking: returns [rank, number of words]
def rank_verbose(user):
    for index, row in most_verbose.iterrows():
        # case-insensitive string comparison because nobody remembers capitalization
        if (row['user'].lower() == user.lower()):
            return [index, row['num_words']]
    # If user doesn't exist in r/UMD:
    return [-1,-1]

# Posting ranking: returns [rank, number of posts]
def rank_posts(user):
    for index, row in most_posts.iterrows():
        # case-insensitive string comparison because nobody remembers capitalization
        if (row['user'].lower() == user.lower()):
            return [index, row['num_posts']]
    # If user doesn't exist in r/UMD:
    return [-1,-1]

# Posting ranking: returns [rank, number of comments]
def rank_comments(user):
    for index, row in most_comments.iterrows():
        # case-insensitive string comparison because nobody remembers capitalization
        if (row['user'].lower() == user.lower()):
            return [index, row['num_comments']]
    # If user doesn't exist in r/UMD:
    return [-1,-1]

Now, we will define a function that, for any user, returns a dictionary containing:

'user' : The users's username

'num_posts' : Total number of posts the user has made in r/UMD

'posts_rank' : Their ranking in terms of their number of posts made in r/UMD as compared to all other users of r/UMD

'num_comments' : Total number of comments the user has made in r/UMD

'comments_rank' : Their ranking in terms of their number of comments made in r/UMD as compared to all other users of r/UMD

'num_words' : Total number of words written in r/UMD by the user

'words_rank' : Their ranking in terms of the number of words they have written in r/UMD as compared to all other users of r/UMD

'first_post_date_utc' : The time of the first post the user made in r/UMD in seconds since 1970

'first_post_date' : The date and time of the first post the user made in r/UMD, presented as a string

'first_post_title' : The title of the user's first post in r/UMD

'first_post_url' : The URL linking to the user's first post in r/UMD

'umd_post_karma' : The total amount of karma the user has accumulated in r/UMD alone from posts

'pop_post_karma' : The greatest amount of karma the user has ever received on a single post to r/UMD

'pop_post_title' : The title of the user's post that received the most karma out of all their posts to r/UMD

'pop_post_url' : The URL linking to the user's post that received the most karma out of all their posts to r/UMD

'worst_post_karma' : The least amount of karma the user has ever received on a single post to r/UMD

'worst_post_title' : The title of the user's post that received the least karma out of all their posts to r/UMD

'worst_post_url' : The URL linking to the user's post that reveived the least karma out of all their posts to r/UMD

'first_comment_date_utc' : The creation time of the first comment that the user ever made on a post in r/UMD in seconds since 1970

'first_comment_date' : The date and time of the first comment the user ever made to a post in r/UMD, presented as a string

'first_comment_body' : The text contained within the first comment the user ever made to a post in r/UMD

'umd_comment_karma' : The total amount of karma the user has received from comments in r/UMD alone

'pop_comment_karma' : The most amount of karma the user has ever received from a comment in r/UMD

'pop_comment_body' : The text contained within the user's comment that received the most karma out of all their comments in r/UMD

'worst_comment_karma' : The least amount of karma the user has ever received from a comment in r/UMD

'worst_comment_body' : The text contained within the user's comment that received the least karma out of all their comments in r/UMD

'total_umd_karma' : The total amount of karma the user has received from all their posts and comments in r/UMD

'favorite_word' : The word that appears most frequently in the user's post titles, post descriptions, and comments in r/UMD

'favorite_adj' : The adjective that appears most frequently in the user's post titles, post descriptions, and comments in r/UMD

'favorite_verb' : The verb that appears most frequently in the user's post titles, post descriptions, and comments in r/UMD

'favorite_noun' : The noun that appears most frequently in the user's post titles, post descriptions, and comments in r/UMD

'sentiment' : Gives a dictionary containing the percentage of the users posts and comments classified as having each sentiment. This dictionary's keys are 'positive', 'neutral', and 'negative'.

In [54]:
pos_noun = ['NN', 'NNS', 'NNP', 'NNPS']
pos_adj = ['JJ', 'JJR', 'JJS']
pos_verb = ['VB', 'VBD', 'VBG', 'VBP', 'VBZ']

# Function that returns dictionary summarizing the r/UMD activity of a particular user
def analyze_user(user):
    verbose = rank_verbose(user)
    if(verbose[0] != -1):
        posts = rank_posts(user)
        comments = rank_comments(user)
        
        # Get more data on user:
        
        # Get some info on their posts
        first_post_date = None
        first_post_title = "NA"
        first_post_url = "NA"
        first_post_karma = 0
        post_karma = 0
        pop_post_karma = 0
        pop_post_title = "NA"
        pop_post_url = "NA"
        hated_post_karma = 0
        hated_post_title = "NA"
        hated_post_url = "NA"
        # sentiment dictionary will temporarily store just the counts for each sentiment
        sentiment = {'positive': 0, 'neutral' : 0, 'negative': 0}
        for index, row in df_post[df_post['name'].str.lower() == user.lower()].iterrows():
            # initialize
            if(first_post_date == None):
                first_post_date = row['created_utc']
                first_post_title = row['title']
                first_post_url = row['url']
                pop_post_karma = row['score']
                hated_post_karma = row['score']
                first_post_karma = row['score']
                
            else:
                if(row['created_utc'] < first_post_date):
                    first_post_date = row['created_utc']
                    first_post_title = row['title']
                    first_post_url = row['url']
                    first_post_karma = row['score']
            post_karma += row['score']
            if(row['score'] >= pop_post_karma):
                pop_post_karma = row['score']
                pop_post_title = row['title']
                pop_post_url = row['url']
            if(row['score'] <= hated_post_karma):
                hated_post_karma = row['score']
                hated_post_title = row['title']
                hated_post_url = row['url']
            sentiment[row['sentiment']] += 1
                
        # Get some info on their comments
        first_comment_date = None
        first_comment_body = "NA"
        comment_karma = 0
        pop_comment_karma = 0
        pop_comment_body = "NA"
        hated_comment_karma = 0
        hated_comment_body = "NA"
        for index, row in df_comment[df_comment['name'].str.lower() == user.lower()].iterrows():
            # initialize
            if(first_comment_date == None):
                first_comment_date = row['created_utc']
                first_comment_body = row['body']
                pop_comment_karma = row['score']
                hated_comment_karma = row['score']
                
            else:
                if(row['created_utc'] < first_comment_date):
                    first_comment_date = row['created_utc']
                    first_comment_body = row['body']
            comment_karma += row['score']
            if(row['score'] >= pop_comment_karma):
                pop_comment_karma = row['score']
                pop_comment_body = row['body']
            if(row['score'] <= hated_comment_karma):
                hated_comment_karma = row['score']
                hated_comment_body = row['body']
            sentiment[row['sentiment']] += 1
        
        total_karma = post_karma + comment_karma
        
        # Find what word the user posts most often
        favorite_word = pd.DataFrame(words_frame[words_frame['user'].str.lower() == user.lower()]['word'].value_counts()).reset_index().at[0,'index']
        # Repeat this, but for each part of speech
        favorite_noun = pd.DataFrame(words_frame[(words_frame['user'].str.lower() == user.lower()) &
                                                 (words_frame['pos'].isin(pos_noun))]['word'].value_counts()).reset_index().at[0,'index']
        favorite_adj = pd.DataFrame(words_frame[(words_frame['user'].str.lower() == user.lower()) &
                                                (words_frame['pos'].isin(pos_adj))]['word'].value_counts()).reset_index().at[0,'index']
        favorite_verb = pd.DataFrame(words_frame[(words_frame['user'].str.lower() == user.lower()) &
                                                 words_frame['pos'].isin(pos_verb)]['word'].value_counts()).reset_index().at[0,'index']
        
        # Calculate percentage of posts with each sentiment
        sentiment['positive'] = 100 * (sentiment['positive'] / (posts[1] + comments[1]))
        sentiment['neutral'] = 100 * (sentiment['neutral'] / (posts[1] + comments[1]))
        sentiment['negative'] = 100 * (sentiment['negative'] / (posts[1] + comments[1]))
        
        # create final dictionary
        results = { 'user': user, 'num_posts': posts[1], 'posts_rank': posts[0],
                    'num_comments': comments[1], 'comments_rank': comments[0],
                    'num_words': verbose[1], 'words_rank': verbose[0],
                    'first_post_date_utc': first_post_date,
                    'first_post_date': time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(first_post_date)),
                    'first_post_title': first_post_title, 'first_post_url': first_post_url,
                    'first_post_karma': first_post_karma,
                    'umd_post_karma': post_karma, 'pop_post_karma': pop_post_karma, 'pop_post_title': pop_post_title,
                    'pop_post_url': pop_post_url, 'worst_post_title': hated_post_title, 'worst_post_karma': hated_post_karma,
                    'worst_post_url': hated_post_url, 'first_comment_date_utc': first_comment_date,
                    'first_comment_date': time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(first_comment_date)),
                    'first_comment_body': first_comment_body, 'umd_comment_karma': comment_karma,
                    'pop_comment_karma': pop_comment_karma, 'pop_comment_body': pop_comment_body,
                    'worst_comment_karma': hated_comment_karma,
                    'worst_comment_body': hated_comment_body, 'total_umd_karma': total_karma, 'favorite_word': favorite_word,
                    'favorite_adj': favorite_adj, 'favorite_verb': favorite_verb, 'favorite_noun': favorite_noun,
                    'sentiment': sentiment}
        return results
    else:
        return None

Now, we can view data for any Reddit user who is part of r/UMD. For example:

In [55]:
vorsteg = analyze_user('vorstegasauras')
dots = analyze_user('UMD_DOTS')
dickerson = analyze_user('ProfJohnDickerson')
miseryy = analyze_user('Miseryy')
print('u/Vorstegasauras\'s most popular post is ', vorsteg['pop_post_title'], 'and you can view it at',
      vorsteg['pop_post_url'] + '.')
print('u/UMD_DOTS\' favorite noun is \'' + dots['favorite_noun'] + '.\'')
print('u/ProfJohnDickerson\'s first comment was made', dickerson['first_comment_date'] + '.')
print(str(miseryy['sentiment']['positive']) + '% of u/Miseryy\'s posts and comments have a positive sentiment.')
u/Vorstegasauras's most popular post is  Tim's Tastings: Date Night and you can view it at https://youtu.be/7rkg4i7pGbI.
u/UMD_DOTS' favorite noun is 'lot.'
u/ProfJohnDickerson's first comment was made 2017-04-02 22:19:40.
76.45186953062849% of u/Miseryy's posts and comments have a positive sentiment.

In [56]:
# Create the plot containing the top words of all types
all_word_counts = words_frame['word'].value_counts()
all_words_fig = go.Figure(
    data=[go.Bar(x=all_word_counts.head(50).index, y=all_word_counts.head(50))],
    layout_title_text="Words with Most Occurrences in r/UMD"
)
all_words_fig.update_xaxes(title_text='Word')
all_words_fig.update_yaxes(title_text='Number of Occurrences in r/UMD')
all_words_fig.show()

While the above plot shows the most popular words in r/UMD, "UMD" itself is the only word specific to University of Maryland (or even higher education in general) that appears in this plot. The rest are all quite generic, mostly consisting of articles, prepositions, and pronouns. We will address this, but first, let's take a look at a pattern that arises in the word frequencies themselves.

Zipf's Law

According to Zipf's Law, the number of occurrences of a word in nearly any body of text is inversely proportional to its rank.

Given that word p with rank(p) = 1 has is known to have occ(p) occurrences, a word w would have:

occ(w) ≈ occ(p) * (1 / rank(w))

However, it appears that r/UMD may not actually follow Zipf's law. Looking at the plot above, the second most popular word ("to") should have approximately half as many occurrences as the most popular word ("the"), and the third most popular word ("I") should have approximately one third as many occurrences as "the," and so on. However, we can easily see from the plot that this is not the case, as the next five words after "the" have at least half as many occurences as "the," which is far more for each than they would be predicted to have based on Zipf's law.

Let's investigate this on a larger scale. Using Zipf's law, we will compute the predicted number of occurences for each word based on the number of occurences of the most popular word and each word's ranking. Then, we will visualize how closely r/UMD follows Zipf's law with a plot of the Zipf's law predictions versus the actual number of occurences. If it follows Zipf's law closely, the slope of a linear regression line on the plot will be approximately 1.

In [57]:
# Create dataframe from the series containing the value counts from each word, rename column names appropriately
all_word_counts_frame = pd.DataFrame(all_word_counts).reset_index()
all_word_counts_frame.rename(columns={'index':'word','word':'actual_count'})
# Add column for the predicted occurrence value according to Zipf's law
all_word_counts_frame.insert(2, 'zipf_count', 0)
# Get the count of the number of occurrences of the most popular word
first_count = all_word_counts_frame.iat[0, 1]
for index, row in all_word_counts_frame.iterrows():
    if(index == 0):
        all_word_counts_frame.at[index, 'zipf_count'] = first_count
    else:
        all_word_counts_frame.at[index, 'zipf_count'] = first_count * (1 / (1 + index))

# Create the plot comparing the predicted number of occurrences to the actual number of occurrences for each word.
zipf_fig = px.scatter(
    # Use .head(1500) to limit our plot to the first 1500. Any more than that slows down the notebook too much when viewing.
    x=all_word_counts_frame[all_word_counts_frame.columns[2]].head(1500),
    y=all_word_counts_frame[all_word_counts_frame.columns[1]].head(1500),
    trendline='ols',
    title="Actual Number of Occurences vs. Expected Number of Occurences for Words in r/UMD"
)
zipf_fig.update_yaxes(title_text='Actual Number of Occurences')
zipf_fig.update_xaxes(title_text='Expected Number of Occurences based on Zipf\'s Law')
zipf_fig.show()

Linear Regression Equation: y = 1.476779 * x + 2637.971522

The slope of the regression line in the plot above is 1.476779, which is significantly higher than 1. Looking at the plot, we can see that the slope would be even higher than that if not for the word with rank 1 (the expected number of occurences for this word has to be equal to the actual number of occurences). Thus, it is clear that the actual number of occurences for each word tends to be much higher than the prediction based on Zipf's law, as was seen in the previous bar plot.

Thus, without the inclusion of some sort of additional coeffient(s) in the calculations of the expected number of occurences, r/UMD does not appear follow Zipf's law very closely. It's impossible to be certain about the reason for this, but this may be because r/UMD is an internet forum filled with abbreviations and misspellings, some of which might be on purpose. Grammar is mostly optional for such forums, which may result in certain words that would otherwise have more occurences occurring only a bit more than the word that would be the next most popular word.

We previously defined pos_noun, pos_adj, and pos_verb, each of which contains the codes defined by NLTK that are associated with certain parts-of-speech (nouns, adjectives, and verbs). We will use these in the following plots to see what the most popular words of each category are, and hopefully find some more popular words commonly associated with University of Maryland.

In [58]:
# Create the plot containing the top nouns
noun_counts = words_frame[words_frame['pos'].isin(pos_noun)]['word'].value_counts().head(50)
noun_fig = go.Figure(
    data=[go.Bar(x=noun_counts.index, y=noun_counts)],
    layout_title_text="Nouns with Most Occurrences in r/UMD"
)
noun_fig.update_xaxes(title_text='Noun')
noun_fig.update_yaxes(title_text='Number of Occurrences in r/UMD')
noun_fig.show()

From the above plot of the most popular nouns of r/UMD, several words immediately stand out as being specific to University of Maryland, the culture of the subreddit, and higher education in general. Some of these include "class," "UMD," "semester," "campus," "students," "CS," "course," "college," "professor," "room," "math," and "program." Immediately, we can see that this plot is more relevant than the plot with all words included.

In [59]:
# Create the plot containing the top adjectives
adj_counts = words_frame[words_frame['pos'].isin(pos_adj)]['word'].value_counts().head(50)
adj_fig = go.Figure(
    data=[go.Bar(x=adj_counts.index, y=adj_counts)],
    layout_title_text="Adjectives with Most Occurrences in r/UMD"
)
adj_fig.update_xaxes(title_text='Adjective')
adj_fig.update_yaxes(title_text='Number of Occurrences in r/UMD')
adj_fig.show()

While the above plot of the most popular adjectives in r/UMD doesn't have as many stand-out UMD-related words at first glance, there are still several words classified as adjectives that are certainly relevant. For example, "major," "umd," and "final" are obvious instances of this. Less obviously, words such as "easy," "difficult," and "hard" are commonly used to described classes or assignments.

We can begin to see some of the limitations of our tokenization and the NLTK's parts-of-speech tokenization in this plot as well, with "words" such as "t" and "it's" being classified as adjectives.

On a lighter note, it is reassuring to see that "good" is the most popular adjective, being approximately three times as popular as "bad." It's always nice to have a positive attitude.

In [60]:
# Create the plot containing the top verbs
verb_counts = words_frame[words_frame['pos'].isin(pos_verb)]['word'].value_counts().head(50)
verb_fig = go.Figure(
    data=[go.Bar(x=verb_counts.index, y=verb_counts)],
    layout_title_text="Verbs with Most Occurrences in r/UMD"
)
verb_fig.update_xaxes(title_text='Verb')
verb_fig.update_yaxes(title_text='Number of Occurrences in r/UMD')
verb_fig.show()

The plot of the most popular verbs of r/UMD is much less obviously specific to r/UMD. However, there are certain words that make a lot of sense to appear often in a college forum. For instance, "taking," "take," and "took" (as in "to take a class"), were very popular, as were "work" and "help."

However, we can see some additional issues that have evidently arisen from our tokenization and parts-of-speech classification, specifically in the words "s" and "m."

On a side note, the above graph displays behavior more a bit more in line with what we might expect from Zipf's law, in that the second most popular word is about half as popular as the first.

Overall, of these word-occurrence plots, the plot of the most popular nouns appears to be the most useful due to the quantity of nouns that can easily be directly associated with University of Maryland that appeared in the plot.

Similar to the "analyze_user" function we defined earlier, we will now define a function that, for any word, returns a dictionary containing the following information:

'occurrences' : Total number of occurrences of the word in r/UMD

'occurrences_posts' : Total number of occurrences of the word in titles of posts to r/UMD

'occurrences_descriptions' : Total number of occurrences of the word in descriptions of posts to r/UMD

'occurrences_comments' : Total number of occurrences of the word in comments posted to r/UMD

'pop_source' : The most common source (title of post, post description, or comment) of the word in r/UMD

'first_occurrence_date' : The date and time that the word first appeared on r/UMD

'first_user' : The username of the user who first posted something containing the word to r/UMD

'fave_user' : The username of the user that has used this word the most out of all users in r/UMD

'fave_user_count' : The number of times the user that used this word the most has said the word

'rank_word' : The ranking in popularity for this word in r/UMD

'sentiment' : Gives a dictionary containing the percentages of the occurrences that occurred within posts or comments with certain sentiments. The keys for this dictionary are 'positive', 'neutral', and 'negative'.

In [61]:
# Function that returns dictionary summarizing the data about a particular word's usage in r/UMD
def analyze_word(query_word):
    # Find the number of occurrences of the word and the word's popularity ranking within r/UMD
    rank_word = -1
    occurrences = 0
    occurrences_posts = 0
    occurrences_desc = 0
    occurrences_comments = 0
    pop_source = 'NA'
    first_user = 'NA'
    fave_user = 'NA'
    fave_user_count = -1
    first_occurrence_date = -1
    sentiment = {'positive': 0, 'neutral': 0, 'negative': 0}
    for index, row in pd.DataFrame(words_frame['word'].value_counts()).reset_index().iterrows():
        if(row['index'].lower() == query_word.lower()):
            if(rank_word < 0):
                rank_word = 1 + index
            occurrences += row['word']
    if(rank_word != -1):
        # filter out all the other words that aren't relevant
        query_words_frame = words_frame[words_frame['word'].str.lower() == query_word.lower()]
        
        # Make dataframe witht the counts for the number of times the word came from a specific source
        query_words_source_frame = pd.DataFrame(query_words_frame['source'].value_counts()).reset_index()
        
        # Find the number of occurrences of the word from each source
        for index, row in query_words_source_frame.iterrows():
            if(row[0] == 'title'):
                occurrences_posts = row['source']
            elif(row[0] == 'description'):
                occurrences_desc = row['source']
            else:
                occurrences_comments = row['source']
                
        # Determine the most popular source of the word
        pop_source = query_words_source_frame.iat[0,0]
            
        # Find the date of the first occurrence of the word, the first user to say it,
        # and count the number of times the word appeared in text classified to have each sentiment
        for index, row in query_words_frame.iterrows():
            if(first_occurrence_date == -1):
                first_occurrence_date = row['date']
                first_user = row['user']
            elif(first_occurrence_date > row['date']):
                first_occurrence_date = row['date']
                first_user = row['user']
            # increment the sentiment count for the type of sentiment the word's associated text has
            sentiment[row['sentiment']] += 1
        
        # Determine which user has said the word the most and how many times they've said it
        users_counts = pd.DataFrame(query_words_frame['user'].value_counts()).reset_index()
        if(users_counts.at[0, 'index'] != 'None'):
            fave_user = users_counts.iat[0, 0]
            fave_user_count = users_counts.iat[0, 1]
        else:
            fave_user = users_counts.iat[1, 0]
            fave_user_count = users_counts.iat[1, 1]
            
        # Convert sentiments to percentages
        sentiment['positive'] = 100 * (sentiment['positive'] / occurrences)
        sentiment['neutral'] = 100 * (sentiment['neutral'] / occurrences)
        sentiment['negative'] = 100 * (sentiment['negative'] / occurrences)
            
    # Create dictionary of results
    results = {'rank_word' : rank_word, 'occurrences' : occurrences, 'occurrences_posts' : occurrences_posts,
                'occurrences_descriptions' : occurrences_desc, 'occurrences_comments' : occurrences_comments,
                'pop_source' : pop_source, 'first_occurrence_date_utc': first_occurrence_date,
                'first_occurrence_date': time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(first_occurrence_date)),
                'first_user': first_user, 'fave_user': fave_user, 'fave_user_count' : fave_user_count,
                'sentiment': sentiment}
    return results

Now we can find data on any word's presence in r/UMD:

In [62]:
umd = analyze_word('umd')
maryland = analyze_word('maryland')
loh = analyze_word('loh')
tastings = analyze_word('tastings')
tendies = analyze_word('tendies')
dickerson = analyze_word('Dickerson')
print('\"UMD\" is ranked', umd['rank_word'], 'in terms of popularity, with', umd['occurrences'], 'occurrences.')
print('\"Maryland\" is ranked', maryland['rank_word'], 'in terms of popularity, with', maryland['occurrences'], 'occurrences.')
print('\"Loh\" was first mentioned', loh['first_occurrence_date'], 'by u/' + loh['first_user'] + '.')
print('\"Tastings\" has been mentioned the most by u/' + tastings['fave_user'] + ',', tastings['fave_user_count'], 'times.')
print('\"Tendies\" have been mentioned', tendies['occurrences'], 'times since their first mention',
      tendies['first_occurrence_date'] + '.')
print(str(dickerson['sentiment']['positive']) + '% of the mentions of \"Dickerson\" in r/UMD have had a positive sentiment.')
"UMD" is ranked 45 in terms of popularity, with 28953 occurrences.
"Maryland" is ranked 252 in terms of popularity, with 4931 occurrences.
"Loh" was first mentioned 2011-04-04 00:15:45 by u/o8643.
"Tastings" has been mentioned the most by u/Vorstegasauras, 14 times.
"Tendies" have been mentioned 118 times since their first mention 2015-04-12 01:28:45.
68.75% of the mentions of "Dickerson" in r/UMD have had a positive sentiment.

Frequency and Sentiment of Word Occurences Over Time

We'll now define two functions: one that displays a graph of any word's frequency of occurrences over time, and another that displays a graph of any word's sentiment over time. The sentiment graphs will feature a LOWESS (Locally Weighted Scatterplot Smoothing) curve to make it easier to visualize the sentiment over time.

In [63]:
def word_time_plot(query_word):
    # Filter out all words except for the query word
    query_frame = words_frame[words_frame['word'] == query_word.lower()]
    # Get the number of times the word was posted each day
    date_counts = query_frame['date_ymd'].value_counts()
    # Make the plot
    query_fig = go.Figure(
        # For x, we are converting from Unix time to the datetime object so that the plot is meaningful
        # (we don't want to display our data in terms of seconds since 1970)
        data=[go.Bar(x=pd.DataFrame(date_counts).reset_index()['index'].apply(lambda x: dt.datetime.fromtimestamp(x)),
                     y=date_counts)],
        # Add a title
        layout_title_text='Occurrences of \"' + str(query_word) + '\" vs. Time',
        # Change plot background color to black
        # (we need to do this because the default light gray background color makes the bars nearly invisible when zoomed out)
        layout_plot_bgcolor='rgb(0,0,0)'
    )
    # Label the axes
    query_fig.update_xaxes(title_text = 'Time')
    query_fig.update_yaxes(title_text = 'Occurrences')
    return query_fig

def sentiment_time_plot(query_word):
    # Filter out all words except for the query word
    query_frame = words_frame[words_frame['word'] == query_word.lower()]
    # Make the plot
    query_fig = px.scatter(
        # For x, we are converting from Unix time to the datetime object so that the plot is meaningful
        # (we don't want to display our data in terms of seconds since 1970)
        x=query_frame['date'].apply(lambda x: dt.datetime.fromtimestamp(x)),
        y=query_frame['sentiment'].apply(lambda x: 1 if x == 'positive' else 0 if x == 'negative' else -1),
        # Add a linear regression line
        trendline="lowess",
        # Add a title
        title='Sentiment of \"' + str(query_word) + '\" vs. Time',
    )
    # Label the axes
    query_fig.update_xaxes(title_text = 'Time')
    query_fig.update_yaxes(title_text = 'Sentiment',
                           ticktext=["Negative", "Neutral", "Positive"],
                           tickvals=[-1, 0, 1])
    return query_fig

Using these functions, we can see the occurrences and sentiment of any word over time. Let's take a look at a few examples.

To begin, let's look at the plot of the occurrences of "snow" over time.

In [64]:
word_time_plot('snow').show()

As is evident from the plot, there tends to be more mentions of "snow" during the winter months for each year. This is unsurprising and uninteresting, but it is a good indicator that our word_time_plot function is functional.

Let's take a look at the sentiment of the posts and comments containing "snow" over time.

In [65]:
sentiment_time_plot('snow').show()

From the LOWESS curve, it appears that the sentiment associated with "snow" tends to be more positive than negative. Given that adequate snowfall can result in cancelled classes, this is unsurprising. It also appears from the relatively flat curve that the general sentiment concerning snow has not changed much year-to-year.

Next, we'll examine "Eduroam" (UMD's current main campus Wi-Fi network).

This has been a controversial topic ever since University of Maryland switched from UMD-Secure to Eduroam as the primary source of Wi-Fi.

In [66]:
# Produce graph of Eduroam's appearances over time
word_time_plot('Eduroam').show()
# Find out which user has mentioned Eduroam the most
eduroam_analysis = analyze_word('Eduroam')
print('u/' + eduroam_analysis['fave_user'], 'has mentioned Eduroam the most out of everyone on r/UMD, with a total of',
     eduroam_analysis['fave_user_count'], 'mentions.')
u/umdit has mentioned Eduroam the most out of everyone on r/UMD, with a total of 93 mentions.

As can be seen in the above graph, before August of 2019, there were very few mentions of Eduroam. Beginning in August 2019 and continuing into the following months, however, "Eduroam" began to appear much more frequently, even appearing as high as 31 times on September 19, 2019. This rise in occurrences coincides with the start of the fall 2019 semester, the first semester in which Eduroam fully replaced UMD-Secure. Many users of Eduroam reported having more connectivity problems than they had with UMD-Secure, and such problems were discussed on r/UMD. Often, u/umdit, the official Reddit account for UMD's IT department, would be involved with discussions of the issues with other users, trying to help problem solve and addressing misconceptions (for example: "Eduroam uses the same infrastructure as umd-secure. We know you're having issues, let us help!"). Accordingly, u/umdit is the user that has mentioned "Eduroam" the most in r/UMD, a total of 93 times, as calculated by the analyze_word function we defined earlier.

The large spike in mentions of "Eduroam" occurring around September 19, 2019, is likely partially a result of the protest that occurred on September 17, 2019. In the following days, there were several posts that mocked the protesters' signs by editing photos of them so that they protested Eduroam, which then likely contributed to a new wave of posts complaining about the quality of Eduroam (and likely inspired posts such as this).

Let's now look at how the sentiment of the posts and comments containing "Eduroam" changes over time.

In [67]:
sentiment_time_plot('Eduroam').show()

The LOWESS curve shows us that r/UMD's opinion of Eduroam began dropping dramatically since the beginning of the fall 2019 semester. This makes sense given the many posts about problems with Eduroam that have appeared in r/UMD since it became the primary campus Wi-Fi network for the fall of 2019, as previously discussed.

Next, let's take a look at the appearances of "Durkin," as in DJ Durkin, the former coach of UMD's football team.

Coach Durkin was embroiled in controversy following the death of UMD football player Jordan McNair in June 2018.

In [68]:
word_time_plot('Durkin').show()

There are several things to note about this plot of the occurrences of "Durkin." First, in the far left portion of the plot, there are two occurrences in December of 2015. This coincides with the initial hire of DJ Durkin as head coach of University of Maryland's football team, which was first reported on December 2, 2015.

The next relatively large cluster of "Durkin" mentions occurred between August 11 and August 19, 2018. This coincides with the time that Durkin was first placed on leave from his position as head coach of the football team, and there was much discussion on r/UMD about this. Interestingly enough, this is the first occasion that Durkin was mentioned in r/UMD after Jordan McNair's death on June 13, 2018, likely indicating that users of r/UMD did not initially consider Durkin to be at fault when news of McNair's death was first spread.

Finally, the largest spike occurred between October 29, 2019 and November 2, 2019. On October 30, the Board of Regents reinstated Durkin as head coach of the football team. On October 31, the very next day, President Loh fired DJ Durkin. Following that, on November 1, a fight broke out at the UMD football practice. All of these events were discussed on r/UMD as they were reported:

Durkin reinstated: "After Maryland Player’s Death, Coach and Athletic Director Keep Their Jobs - The New York Times"

Durkin fired: "Durkin Fired"

Football Fight: "Fight breaks out among Maryland football players at practice in wake of Durkin drama"

We'll do a plot of the appearances of "Loh" (as in President Wallace Loh) over time to see if his name has similar spikes as Durkin's name.

In [69]:
word_time_plot('Loh').show()

Loh has clearly been mentioned overall far more than DJ Durkin (which makes sense given that he is the president of the university), however, Loh shares the same spikes in mentions as Durkin in August 2018 and late October/early November of 2018. This makes sense, as Loh was the one to ultimately fire Durkin, and Loh's retirement was announced on the same day that Durkin was initially reinstated (October 30, 2019).

Let's now take a brief look at the sentiment for Durkin and Loh over time, starting with Durkin.

In [70]:
sentiment_time_plot('Durkin').show()

Although there were very few mentions of Durkin before August 2018, there is a significant noticable drop in sentiment level that takes place around August to November 2018. This makes sense given that this is the timeframe during which Durkin was placed on leave, reinstated, and then suspended.

In [71]:
sentiment_time_plot('Loh').show()

The sentiment of posts and comments containing mentions of "Loh" appears to have stayed approximately the same throughout r/UMD's existence, as the LOWESS curve nearly has a flat slope. Being the president of the university, Loh has always had and always will have supporters and critics. If we look closely, however, the slope of the trendline is ever-so-slightly negative, which may be partially a result of the controversy following McNair's death. The trendline overall appears to be much more positive than negative, but this may be able to be attributed to the VADER SentimentIntensityAnalyzer misclassifying sarcastic posts and comments as positive ones.

We'll now examine the occurrences of "Penn" (as in "Penn State") over time.

In [72]:
word_time_plot('Penn').show()

In examining the above plot, we can see that there have been various mentions of "Penn" (likely referring to Penn State) over the years, popping up now and then normally around when UMD and Penn State played each other in football (for example, October 24, 2015, October 8, 2016, etc.). However, in 2019, there was a major spike in mentions of "Penn" in September. This coincides with the major Maryland vs. Penn State home game that took place on Friday, September 27. The University had said that classes would not meet in person during the afternoon of the game, and in the weeks leading up to it, r/UMD was innundated with posts about obtaining and selling tickets to the game (for example, "Anyone wanna sell me a Penn State ticket", and "NEED PENN STATE TICKET TRYING TO BUY ONE BEFORE TODAY ENDS").

Following Maryland's 59-0 defeat during the game, several users posted memes about the loss: "UMD vs Penn State: A Halftime Report", "UMD Cheerleaders when a touchdown to put Penn State up 45 gets called back and we are only down 38"

Let's take a look at the sentiment of posts and comments mentioning "Penn" over time.

In [73]:
sentiment_time_plot('Penn').show()

This negatively-sloping LOWESS curve can easily be explained by an increase in posts insulting Penn State surrounding sporting events. For instance, see this post, as well as this post. Again, the trendline overall seems to be more positive than negative, which may result from the influence of sarcasm.

Finally, let's examine mentions of "Iribe" over time.

"Iribe" refers to Brendan Iribe, co-founder of Oculus VR and the namesake of the Iribe Center for Computer Science and Engineering.

In [74]:
word_time_plot('Iribe').show()

It is immediately clear from this plot that mentions of "Iribe" have dramatically increased in frequency since the first mention in 2014. This corresponds with the completion of the construction of the Iribe Center. Mentions of Iribe begin to increase dramatically throughout the first half of 2019, when portions of the building first began opening. This is also when the grand opening of the Iribe Center occurred, attended by Brendan Iribe himself on April 27, 2019. Following May 31, there are almost no mentions of Iribe until mid August (which corresponds to summer break, when relatively few people were on campus). Following that, a major influx of Iribe mentions comes during the fall 2019 semester as the building opened to classes completely.

Another interesting aspect of this plot is Iribe's first mention in r/UMD, on April 2, 2014. One might expect that this corresponds to the announcement of Brendan Iribe's donation to build the Iribe Center, however, that did not take place until September 11, 2014. Strangely enough, it appears that Iribe's donation was completely ignored by r/UMD when it was first announced, as there were no mentions of Iribe in September of 2014.

Let's dig a bit further using our analyze_word function on "Iribe" to find out who made the first post to mention the name:

In [75]:
print('First user to mention \"Iribe\": u/' + analyze_word('Iribe')['first_user'])
First user to mention "Iribe": u/SyntheticBiology

Let's now take a look at the first post by the user named "SyntheticBiology" and see if this is also the first mention of Iribe in r/UMD:

In [76]:
synth_bio = analyze_user('SyntheticBiology')
print('u/SyntheticBiology\'s first post on r/UMD was \"' + synth_bio['first_post_title'] + '.\"')
print('Its URL is', synth_bio['first_post_url'] + '.')
print('It received',synth_bio['first_post_karma'],'upvotes.')
u/SyntheticBiology's first post on r/UMD was "Brendan Iribe and Michael Antonov from Oculus VR are giving a talk on campus this Friday (Apr 4)."
Its URL is http://cmns.umd.edu/news-events/events/2010.
It received 3 upvotes.

Well, it's our lucky day! We've found the first post containing a mention of Iribe in r/UMD. While the URL that was posted appears to be a dead link, the post's title is enough to tell us that the post was about Iribe and Antonov coming to give a talk on April 4, 2014. Also occurring on April 4, 2014 was the start of the first Bitcamp, University of Maryland's largest hackathon, which was kicked off with a keynote speech by Iribe and Antonov, so it is likely that this post was referring to this speech at Bitcamp.

As described in this Washington Post article:

On a recent visit to U-Md. — where Iribe first met his business partner, Michael Antonov, in a freshman dorm in 1998 — the 35-year-old Californian attended a school-sponsored “hackathon,” in which students use technology to solve a problem in a short amount of time. He met with professors and spoke to hundreds of students, impressed with their energy. But walking into the computer science center on campus, he said he found the facility “depressing” and “a lot worse than I remembered it.” ("Brendan Iribe, co-founder of Oculus VR, makes record $31 million donation to U-Md.", by Nick Anderson)

This article states that Iribe was inspired to make the massive donation required to build the Iribe Center during a disappointing stop in one of the computer science buildings (likely A.V. Williams) while visiting UMD to speak at a hackathon. The hackathon referred to by the article had to be Bitcamp 2014. Because of this immensely important visit, today University of Maryland has the massive state-of-the-art building within which our CMSC320 lectures take place.

It's funny to think about the Iribe Center's very existence all started with a simple campus visit in April 2014 that was almost entirely ignored (having received a meager three upvotes) when first posted about to r/UMD. Not even u/SyntheticBiology could have predicted the far-reaching effects that visit would have at University of Maryland in the years to come.

Finally, let's take a look at the trend in sentiment for mentions of "Iribe."

In [77]:
sentiment_time_plot('Iribe').show()

While the general sentiment surrounding Iribe seems to be overwhelmingly positive, there is a noticable waver in the LOWESS curve beginning in mid-2018. This may be a result of controversies and complaints surrounding the new building, such as noise complaints, problems with the doors, complaints about roof access, and complaints about the building's usage, thrown in with a variety of more positive posts and comments.

Final Fun Facts

Let's do a few final calculations just for fun.

Most upvoted posts in r/UMD:

We'll just sort our df_post dataframe by the score (number of upvotes) to find the top ten most popular posts.

In [78]:
place = 1
# Sort by the number of upvotes
for index, row in df_post.sort_values(by='score', ascending=False).head(10).iterrows():
    print(str(place) + '.')
    print('Post Title: \"' + row['title'] + '\"')
    print('User: u/' + row['name'])
    print('Score:', row['score'])
    print('URL:', row['url'])
    print('Date Posted:', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(row['created_utc'])))
    print(header_str, header_str)
    place += 1
1.
Post Title: "had to sacrifice my best friend to testudo :("
User: u/hunterm19
Score: 730
URL: https://v.redd.it/p4siu7px9lz21
Date Posted: 2019-05-21 16:14:27
~~~~~~~~~~ ~~~~~~~~~~
2.
Post Title: "The Beekeeping Club is having a bake sale"
User: u/lukedestroyer12
Score: 659
URL: https://i.redd.it/jyffhw5kw6n31.jpg
Date Posted: 2019-09-17 17:32:42
~~~~~~~~~~ ~~~~~~~~~~
3.
Post Title: "The duality of man"
User: u/kapperstick
Score: 658
URL: https://i.redd.it/tahhk3xv40r31.jpg
Date Posted: 2019-10-06 23:03:23
~~~~~~~~~~ ~~~~~~~~~~
4.
Post Title: "This is a very rare Frosty Testudo. Donate upvotes to Frosty Testudo to end Maryland's heatwave."
User: u/PoshLagoon
Score: 654
URL: https://i.redd.it/rxrap93nixi11.jpg
Date Posted: 2018-08-29 00:46:47
~~~~~~~~~~ ~~~~~~~~~~
5.
Post Title: "1 Upvote = 1 Prayer for School Closure"
User: u/Snowmuhgeddon
Score: 572
URL: https://www.reddit.com/r/UMD/comments/asfk6x/1_upvote_1_prayer_for_school_closure/
Date Posted: 2019-02-19 21:19:23
~~~~~~~~~~ ~~~~~~~~~~
6.
Post Title: "Welcome to your UMD orientation! Here in text format, for those that can't attend physically."
User: u/non_troppo
Score: 529
URL: https://www.reddit.com/r/UMD/comments/cectuv/welcome_to_your_umd_orientation_here_in_text/
Date Posted: 2019-07-17 13:26:59
~~~~~~~~~~ ~~~~~~~~~~
7.
Post Title: "I prefer the real Maryland"
User: u/Ares__
Score: 522
URL: https://imgur.com/HMQiY5o
Date Posted: 2019-09-28 18:24:47
~~~~~~~~~~ ~~~~~~~~~~
8.
Post Title: "Watch out if you're subleasing in the Varsity! Legitimate craziest thing to every happen to me."
User: u/bnakebnake
Score: 518
URL: https://www.reddit.com/r/UMD/comments/buoiz4/watch_out_if_youre_subleasing_in_the_varsity/
Date Posted: 2019-05-30 03:48:11
~~~~~~~~~~ ~~~~~~~~~~
9.
Post Title: "How the Beekeeping club looks having a bake sale in the middle of the chaos"
User: u/PoshLagoon
Score: 515
URL: https://i.redd.it/qz2d1x3k37n31.png
Date Posted: 2019-09-17 18:12:22
~~~~~~~~~~ ~~~~~~~~~~
10.
Post Title: "Upvote if you want the new Computer Science building cafe to be named Snack Overflow Cafe!"
User: u/SnackOverflowCafe
Score: 510
URL: https://www.reddit.com/r/UMD/comments/a6ko2n/upvote_if_you_want_the_new_computer_science/
Date Posted: 2018-12-16 00:50:04
~~~~~~~~~~ ~~~~~~~~~~

Most Upvoted/Downvoted Comments in r/UMD:

The processes for finding the most upvoted posts and the least upvoted posts are identical, so we'll just define a function that can do both depending on the boolean that is passed into it.

In [79]:
# If worst is True, find the most downvoted comments, otherwise find the most upvoted comments
def best_worst_comments(worst):
    if(worst == False):
        print('Top ten most upvoted comments in r/UMD:')
    else:
        print('Top ten most downvoted comments in r/UMD:')
    place = 1
    # Sort by the number of upvotes
    for index, row in df_comment.sort_values(by='score', ascending=worst).head(10).iterrows():

        # We don't have URLs that link directly to the comments in this data, so we'll find the post.
        parent_id = row['parent_id']
        parent_url = 'Not Available'
        # If the comment is a reply to another comment, we'll need to walk back up the chain until the parent is a post.
        while(parent_id in df_comment['id']):
            # find the index of the parent comment and use that to get the next parent
            for index, item in df_comment['id'].iteritems():
                if(item == parent_id):
                    parent_id = df_comment.iat[index, 4]

        # the parent_id must refer to a post by the end of the while loop, so we'll get the URL from df_post
        for index, item in df_post['id'].iteritems():
            # Each ID from the comments has a 'tX_' prefix, so we'll check for a substring to ignore this part
            if(item in parent_id):
                parent_url = df_post.iat[index, 2]

        print(str(place) + '.')
        print('Comment: \"' + row['body'] + '\"')
        print('User: u/' + row['name'])
        print('Score:', row['score'])
        print('Date Posted:', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(row['created_utc'])))
        print('Parent Post:', parent_url)
        print(header_str, header_str)
        place += 1

Most Upvoted Comments in r/UMD:

In [80]:
# Call our function to get the most upvoted comments in r/UMD
best_worst_comments(False)
Top ten most upvoted comments in r/UMD:
1.
Comment: "Girl who tried to break it up was the real winner of this fight"
User: u/Kogie13
Score: 310
Date Posted: 2019-03-08 23:52:29
Parent Post: https://v.redd.it/guva156hczk21
~~~~~~~~~~ ~~~~~~~~~~
2.
Comment: "ok you need to stop procrastinating "
User: u/birdwordwords
Score: 289
Date Posted: 2018-11-15 00:30:33
Parent Post: https://www.reddit.com/r/UMD/comments/9x5rre/i_got_caught/
~~~~~~~~~~ ~~~~~~~~~~
3.
Comment: "sofloantonio"
User: u/shelled15
Score: 282
Date Posted: 2015-11-12 22:32:05
Parent Post: https://www.reddit.com/r/UMD/comments/3sl6wr/what_are_the_worst_things_about_umd/
~~~~~~~~~~ ~~~~~~~~~~
4.
Comment: "You should absolutely appeal this.  It's still 11:55 until it's 11:56, and they gave you the ticket while it was still 11:55."
User: u/subterraniac
Score: 260
Date Posted: 2019-03-11 23:49:47
Parent Post: https://i.redd.it/mibejk3k7kl21.jpg
~~~~~~~~~~ ~~~~~~~~~~
5.
Comment: "Buckle up for life in prison buddy"
User: u/chair96
Score: 257
Date Posted: 2018-04-29 00:51:11
Parent Post: https://www.reddit.com/r/UMD/comments/8fo1o3/caught_smoking_pot_on_campus_dorm/
~~~~~~~~~~ ~~~~~~~~~~
6.
Comment: "Fuck or get fucked boi welcome to PG County."
User: u/Shadow196840
Score: 247
Date Posted: 2019-10-06 15:31:55
Parent Post: https://www.reddit.com/r/UMD/comments/de4dow/state_of_college_park/
~~~~~~~~~~ ~~~~~~~~~~
7.
Comment: "A bit obnoxious but we think it was good for business. It was our spot first so we didn't leave

We're still up until 4PM if anyone wants some bee products or baked goods. 

Be positive and support bees 🐝🐝🐝 👍"
User: u/UMDbees
Score: 243
Date Posted: 2019-09-17 18:05:43
Parent Post: Not Available
~~~~~~~~~~ ~~~~~~~~~~
8.
Comment: "Can you please not? Whatever's happening that makes you feel this way I promise will get better. At the very least, you gain nothing from doing this, you only hurt others.

Also if you're a troll, this is pretty sick."
User: u/MontereyJack144
Score: 240
Date Posted: 2012-03-11 02:30:37
Parent Post: https://www.reddit.com/r/UMD/comments/qr0kw/im_thinking_about_going_on_a_shooting_rampage_all/
~~~~~~~~~~ ~~~~~~~~~~
9.
Comment: "can we take a moment to appreciate the fact that this man made an account for this question"
User: u/Pwnemon
Score: 239
Date Posted: 2016-04-18 00:02:10
Parent Post: https://www.reddit.com/r/UMD/comments/4f90xw/cows_at_umd/
~~~~~~~~~~ ~~~~~~~~~~
10.
Comment: "We're on it."
User: u/umd_security
Score: 233
Date Posted: 2017-10-18 20:34:25
Parent Post: https://www.reddit.com/r/UMD/comments/779dpq/people_throwing_eggs_at_me/
~~~~~~~~~~ ~~~~~~~~~~

Most Downvoted Comments in r/UMD:

In [81]:
# Call our function to get the most downvoted comments in r/UMD
best_worst_comments(True)
Top ten most downvoted comments in r/UMD:
1.
Comment: "Just signed up.. "
User: u/cms2337
Score: -394
Date Posted: 2014-01-21 18:35:29
Parent Post: https://www.coursera.org/course/android?from_restricted_preview=1&course_id=971246&r=https%3A%2F%2Fclass.coursera.org%2Fandroid-001%2Fclass
~~~~~~~~~~ ~~~~~~~~~~
2.
Comment: "You sound fun at parties... "
User: u/als7798
Score: -175
Date Posted: 2018-03-27 19:43:01
Parent Post: https://www.reddit.com/r/UMD/comments/87l1xv/dear_guy_who_hit_a_parked_car_in_mowatt_lane/
~~~~~~~~~~ ~~~~~~~~~~
3.
Comment: "It's Saturday. "
User: u/None
Score: -106
Date Posted: 2017-12-09 15:39:00
Parent Post: Not Available
~~~~~~~~~~ ~~~~~~~~~~
4.
Comment: "I don’t go to UMD"
User: u/Souflay_Boi
Score: -104
Date Posted: 2019-03-06 15:30:45
Parent Post: https://www.reddit.com/r/UMD/comments/axytyy/stay_warm_terps/
~~~~~~~~~~ ~~~~~~~~~~
5.
Comment: "Actually, Vector isn't a scam as you would see if you bothered to do 5 seconds of research. I have worked with Vector for over 2 years and am 100% satisfied with all aspects of my job, especially pay and flexibility.

There are a lot of rumors and myths that claim Vector Marketing is a scam, preying on naïve, but ambitious college students. But this is false and ridiculous as anyone with half a brain can deduce.

>Vector is a well-known MLM company, basically a legal pyramid scheme.

Wrong. Vector Marketing is not a pyramid scheme in any way, shape or form. Vector Marketing is the sales and marketing division of Cutco. Vector reps are not responsible for recruiting new reps or buying any sort of product or service. In fact, Vector reps are independent contractors and they set their own schedules and have the opportunity to control how much they earn through a guaranteed base pay and commissions earned on each sale. Vector Marketing is also not a "get-rich-quick" scheme. Success is not guaranteed and it may take hard-work and dedication in order to succeed as a Vector rep.

Please read https://vectormarketing.com/vector-truth/ to educate yourself."
User: u/VectorMarketingRep
Score: -102
Date Posted: 2017-12-26 01:07:30
Parent Post: Not Available
~~~~~~~~~~ ~~~~~~~~~~
6.
Comment: "Ok I also know Gio and you're situation may be entirely truthful, but that doesn't give you the right whatsoever to slander someone by name like this. If you really have a problem, you would've complained or reported to the school instead of directly calling someone out on a public forum like a bitch. This article had nothing to do about you or your fucking sob story, this has to do with academics so nothing of what you said should be used against him in this case. Slandering someone like this by name is even worse and proves you are what's wrong with this planet, constantly using your own selfish gains in a cutthroat manner to slander someone else. Even though this article may be subjective, check yourself. This has nothing to do with you so stop bringing up extreme points for absolutely no reason that has nothing to do with the original post. "
User: u/xxYungHuarachexx
Score: -93
Date Posted: 2017-05-21 22:21:58
Parent Post: Not Available
~~~~~~~~~~ ~~~~~~~~~~
7.
Comment: "Mmmmm comp sci majors sitting on a fat stack of cash after college with companies waiting in line to hire us. Enjoy your ass kissing to even get a job."
User: u/Meditos
Score: -93
Date Posted: 2019-05-09 19:17:08
Parent Post: https://i.redd.it/klish23aa8x21.png
~~~~~~~~~~ ~~~~~~~~~~
8.
Comment: "It isn't like the STEM students are going to go to the game so a wise decision to keep the labs open.  On a plus note, catering to the sports fans is still better than catering to the SJWs."
User: u/TheyTheirsThem
Score: -91
Date Posted: 2019-07-22 15:57:41
Parent Post: Not Available
~~~~~~~~~~ ~~~~~~~~~~
9.
Comment: "another reason why sports are shit"
User: u/tittie_terp
Score: -90
Date Posted: 2017-11-13 17:44:03
Parent Post: https://www.reddit.com/r/CollegeBasketball/comments/7cnbrh/missed_connection_maryland_student_rejected_on/
~~~~~~~~~~ ~~~~~~~~~~
10.
Comment: "Are there classes on Saturday?"
User: u/None
Score: -89
Date Posted: 2017-12-09 16:01:18
Parent Post: Not Available
~~~~~~~~~~ ~~~~~~~~~~

First Posts Ever Made in r/UMD

Finally, let's go back to the genesis of r/UMD, and see what was going on in 2010.

In [13]:
place = 1
# Sort by the date
for index, row in df_post.sort_values(by='created_utc', ascending=True).head(10).iterrows():
    print(str(place) + '.')
    print('Post Title: \"' + row['title'] + '\"')
    print('User: u/' + row['name'])
    print('Score:', row['score'])
    print('URL:', row['url'])
    print('Date Posted:', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(row['created_utc'])))
    print('~~~~~~~~~~ ~~~~~~~~~~')
    place += 1
1.
Post Title: "The Camera that Sees Sound!"
User: u/unwavering
Score: 3
URL: http://www.umiacs.umd.edu/~odonovan/Audio_Camera/
Date Posted: 2010-06-25 04:16:39
~~~~~~~~~~ ~~~~~~~~~~
2.
Post Title: "Poop? In my McKeldin?"
User: u/None
Score: 1
URL: https://www.reddit.com/r/UMD/comments/cj526/poop_in_my_mckeldin/
Date Posted: 2010-06-26 01:13:33
~~~~~~~~~~ ~~~~~~~~~~
3.
Post Title: "Poop? In my McKeldin?"
User: u/chrisg90
Score: 6
URL: https://www.reddit.com/r/UMD/comments/cj564/poop_in_my_mckeldin/
Date Posted: 2010-06-26 01:27:48
~~~~~~~~~~ ~~~~~~~~~~
4.
Post Title: "Welcome to the UMD subreddit!"
User: u/maxpericulosus
Score: 4
URL: https://www.reddit.com/r/UMD/comments/cj6m4/welcome_to_the_umd_subreddit/
Date Posted: 2010-06-26 04:45:04
~~~~~~~~~~ ~~~~~~~~~~
5.
Post Title: "All of you people that are ahead of me on the wait list, I'm going to need you to go ahead and back out so I can get in my class - Thanks"
User: u/Ares__
Score: 3
URL: https://www.reddit.com/r/UMD/comments/cscgl/all_of_you_people_that_are_ahead_of_me_on_the/
Date Posted: 2010-07-22 05:44:03
~~~~~~~~~~ ~~~~~~~~~~
6.
Post Title: "Easy Electives?"
User: u/None
Score: 3
URL: https://www.reddit.com/r/UMD/comments/csivu/easy_electives/
Date Posted: 2010-07-22 16:41:26
~~~~~~~~~~ ~~~~~~~~~~
7.
Post Title: "What is you major and year?"
User: u/Ares__
Score: 13
URL: https://www.reddit.com/r/UMD/comments/csow3/what_is_you_major_and_year/
Date Posted: 2010-07-23 01:20:56
~~~~~~~~~~ ~~~~~~~~~~
8.
Post Title: "What groups/clubs do you belong to?"
User: u/maxpericulosus
Score: 4
URL: https://www.reddit.com/r/UMD/comments/csp9r/what_groupsclubs_do_you_belong_to/
Date Posted: 2010-07-23 02:01:05
~~~~~~~~~~ ~~~~~~~~~~
9.
Post Title: "Since there seems to be a few terp redditors, would you all want to have a meet up in the fall?"
User: u/Ares__
Score: 7
URL: https://www.reddit.com/r/UMD/comments/ct0b1/since_there_seems_to_be_a_few_terp_redditors/
Date Posted: 2010-07-23 20:02:10
~~~~~~~~~~ ~~~~~~~~~~
10.
Post Title: "Thoughts on the closing of Campus Drive?"
User: u/None
Score: 3
URL: https://www.reddit.com/r/UMD/comments/ctmjg/thoughts_on_the_closing_of_campus_drive/
Date Posted: 2010-07-26 00:57:28
~~~~~~~~~~ ~~~~~~~~~~

It is interesting to note that while r/UMD itself was created on April 15, 2010, it appears that the first post was not made until June 25, 2010.